Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distinguish POSSE posts vs non-POSSE mentions and handle accordingly #51

Closed
snarfed opened this issue Jan 31, 2014 · 51 comments · Fixed by #465
Closed

distinguish POSSE posts vs non-POSSE mentions and handle accordingly #51

snarfed opened this issue Jan 31, 2014 · 51 comments · Fixed by #465
Assignees
Labels

Comments

@snarfed
Copy link
Owner

snarfed commented Jan 31, 2014

this would be nice for catching when other people post a link to your post in a silo.

i did this for a while in mid 2012, before bridgy's re-release with webmentions. i stopped because the POSSEd posts showed up as comments on the original posts, and i kept that decision in the re-release because i didn't see enough people using rel-syndication links, which meant i couldn't prevent the same thing happening to them.

on the other hand, we've been thinking more about de-duping and similar issues recently, and @tantek proposed that this kind of noise might help motivate people to make their mention handling smarter. worth a thought.

@snarfed
Copy link
Owner Author

snarfed commented Jan 31, 2014

concretely, these would only differ from current webmentions in that they wouldn't have an in-reply-to, since they truly are "mentions."

@snarfed snarfed changed the title send webmentions for original POSSE silo posts Apr 14, 2014
@snarfed
Copy link
Owner Author

snarfed commented Apr 14, 2014

two possible approaches for distinguishing the original author's POSSEd posts:

  • don't bother. ideally, webmention handlers would detect them and filter them out, or whatever they want. (@tantek advocates this.)
  • omit original silo posts from the author, but not from other people.

both are reasonable, and this would be a good feature. promoting to now.

@snarfed
Copy link
Owner Author

snarfed commented Aug 26, 2014

lots of discussion about this on IRC today.

summary: when tweet links to a post, but isn't the official POSSE tweet of that post, responses are backfed and rendered as if they were responses to the original post. two examples. some people like this somewhat (e.g. @snarfed, @kevinmarks, maybe @kylewm); others don't (@aaronpk, @tantek).

it's hard to prevent this. @tantek correctly notes that we can use rel=me to identify the original author, and only treat their tweets as POSSE candidates. that's a good step.

however, the common case is that the original author later links to their post from a different (non-POSSE) tweet. we could use u-syndication and permashortcitations to distinguish that from the original POSSE tweet, but both of those have low adoption rates among bridgy users, so we'd end up muzzling the majority of responses, which i don't want to do.

@kevinmarks suggests that we use time as a heuristic. if the author links to their post over 24h after it's originally posted, don't consider that a POSSE. definitely a good idea!

(i'd re-emphasize that this is all tradeoffs. given real world usage, i don't see a single best answer so far, and leaving the current behavior is on the table. good to hash through options though!)

@snarfed snarfed changed the title send webmentions for posts as well as responses Aug 26, 2014
@snarfed snarfed added now and removed now labels Aug 26, 2014
@snarfed snarfed removed the later label Sep 4, 2014
@snarfed
Copy link
Owner Author

snarfed commented Jan 8, 2015

current proposal from @tantek in IRC today: only consider a link to be the original copy if it's on a domain in the user's silo profile. sounds ok to me, we could consider implementing it.

@kylewm
Copy link
Contributor

kylewm commented Apr 13, 2015

✋ In case it's useful, here's an example where Bridgy is being overly aggressive in assuming a tweet is the POSSE copy of an original.

here's the original: https://adactio.com/journal/8710
here's a tweet from someone else (another bridgy user) linking to the original: https://twitter.com/jgarber/status/587245857034133504

and then a bunch of RT's of that tweet are backfed to the original as if they are RTs of the original. e.g., https://brid-gy.appspot.com/repost/twitter/jgarber/587245857034133504/587680705938907136

@snarfed
Copy link
Owner Author

snarfed commented Apr 13, 2015

thanks @kylewm!

one way to mitigate: when the post's domain isn't one of the tweet author's domains, demote to u-mention.

@snarfed
Copy link
Owner Author

snarfed commented Aug 28, 2015

some new thoughts from #452:

here's a concrete example. i recently tweeted this:

My silly privacy antics landed me in a @vice @Motherboard article on prepaid credit cards. Fun, mildly embarrassing. http://motherboard.vice.com/read/the-simple-trick-ashley-madisons-users-could-have-used-to-protect-themselves

with this new feature, we'd attempt to send a webmention with this tweet as the source and the motherboard.vice.com link as the target. of course, the source wouldn't actually be the twitter.com permalink, it'd be the bridgy proxy URL that renders the tweet as mf2.

one interesting question is whether to do consider this part of "listen" or "publish." ie should we start doing this when you sign up for backfeed? or only when you enable publish? it's not clear to me which one it belongs to. i'm leaning toward listen (backfeed), but not sure.

also, a catch: POSSE/PESOSed silo posts would end up sending multiple wms, one from the original post and one from each silo post, so the target would end up showing duplicates. bridgy already causes this for POSSEd comments/likes/reposts, though, so it's not a new problem, and we've pretty much agreed that it's the recipient's job to use syndication links, etc to de-dupe.

@snarfed
Copy link
Owner Author

snarfed commented Aug 28, 2015

an idea for expanding this: search silos for any posts, from anyone, that link to the user's domain(s), and send wms for them too. these are effectively mentions.

silo support for this is mixed:

moved this to #456

@snarfed
Copy link
Owner Author

snarfed commented Aug 29, 2015

added the full set of OPD heuristics to the IWC wiki. the important part for implementing is:

When considering a backlink in a silo post, use most or all of these heuristics to determine whether it's a POSSE:

  • The backlink must be at or near the end. (Allow e.g. a close paren after the link.)
  • The backlink must point to one of the user's domains, as determined by rel-me and links in their silo profile.
  • The silo post must be published within 24h of the original post.
  • New: compare the silo post's text and the original post's name, summary, and/or content, taking prefixes if they're meaningfully longer. (If the silo post has an ellipsis at or near the end, that's a strong hint to use a prefix.) The edit distance should be below a certain threshold, disregarding common differences like @-usernames in silo posts vs human names in original posts (e.g. this OP vs this POSSE).

current plan is to skip the last one due to complexity. i think the first three get us 80-95% of the value.

@snarfed
Copy link
Owner Author

snarfed commented Sep 1, 2015

reorganizing this slightly. this issue will cover implementing the algorithm above for determining whether a silo post is a POSSE. if it is, we won't send a wm from it to the original post, but we will send its responses. if it isn't a POSSE, we'll send wms to each link in its text (and attachments, etc), as mentions, but we won't send wms for its responses anywhere.

@kylewm @tantek @kevinmarks @aaronpk @kartikprabhu i know this has been controversial for a while now. does that sound like the ideal behavior?

i'm opening a new issue for the feature to search all silo posts for links to users' sites and send mentions for those: #456

@snarfed snarfed changed the title send webmentions for (non-POSSE) posts as well as responses Sep 1, 2015
@snarfed snarfed added the now label Sep 1, 2015
@kevinmarks
Copy link

Not sure that is ideal - the pattern I get currently is that I quote an old post, my link to it is assumed to be POSSE, and so it isn't shown, but replies are. If it shows my non-pOSEE link, the follow-ups are often interesting too, with that context.

@snarfed
Copy link
Owner Author

snarfed commented Sep 2, 2015

@kevinmarks thanks for reviewing, and good point! ok, so for non-POSSE mentions, we backfeed replies, but not likes or reposts. sound good?

@snarfed
Copy link
Owner Author

snarfed commented Sep 2, 2015

@kevinmarks on second thought, comparing to pure indieweb behavior...if i include a link in a post, I'd send a mention to it, but i wouldn't also send wms to it for each comment i get on my post, nor would i expect the commenters to send wms directly from their comment posts, since they're not replying to or mentioning that link. so... maybe we shouldn't backfeed replies to mentions after all?

@kylewm
Copy link
Contributor

kylewm commented Sep 2, 2015

I agree with that last bit -- Instead of backfeeding only the responses to a mention, it should only backfeed the mention itself. Replies to a mention are not replies to the original.

Unfortunately that means it matters even more that Bridgy guess correctly that something is a mention rather than a syndication (or err on the side of assuming syndication unless proven otherwise)... @snarfed in particular often rewords the silo copy so that I don't think edit distance would find them very similar at all, even though all the same information is contained (e.g. https://snarfed.org/2015-08-26_15313).

@armingrewe
Copy link

Just to confirm, as far as I can tell the Twitter and G+ mentions are now flowing through again. On the blog with the most activity I usually post my morning (UK, ~6:30 GMT/BST) and the majority of mentions come over the next few hours. All fine so far.

@snarfed
Copy link
Owner Author

snarfed commented Sep 14, 2015

thanks for the update @armingrewe! glad to hear it.

btw Facebook should work in general too, but I know you mentioned it hasn't for you. feel free to post details if you want!

@armingrewe
Copy link

Facebook was fine all the time ;-) There might be something where bridgy isn't picking up something when I post via WordPress, but I need to look at that before I can be sure if there's an issue.

@snarfed
Copy link
Owner Author

snarfed commented Sep 15, 2015

i've updated the discussion of these OPD heuristics in https://indiewebcamp.com/original-post-discovery#Brainstorming . tldr: there are four, and we've hit real world counterexamples for all of them in bridgy, so none are ideal.

  • user's domain
  • within 24h
  • near the end of the silo post
  • nearly the same text as the silo post, ie edit distance is below a given threshold
@kylewm
Copy link
Contributor

kylewm commented Sep 15, 2015

few random thoughts...

Another possible heuristic: have we already seen a POSSE for this post on this service? if so, it's more likely that subsequent links are mentions. It's not that strong of a criteria because many people will tweet links to the same piece throughout the day (e.g. Dave Winer), and of course tweets are deleted and reposted as edits.

It's much more costly to incorrectly identify a POSSE copy as a mention, i.e. no backfeed for that post. So the threshold for qualifying as a POSSE copy should probably be way lower, maybe matching some subset of the criteria, like off the top of my head:

* any two of the first three
* any one of the first three + lower than 50% edit distance
* lower than 30% edit distance

It's very difficult to correctly categorize the "Kevin tweets a link to his post within 24h" case without throwing out a lot of legitimate POSSEs. In the specific case on the wiki, we could say it looks like he is tweeting at someone but the original isn't in-reply-to anything...wonder if that applies more generally to self-mentions.

@snarfed
Copy link
Owner Author

snarfed commented Sep 15, 2015

thanks @kylewm! interesting idea to record inferred POSSE links and check them later. kind of an extension of the way we already store syndication links. and you're right, the standard way to handle a complicated inference like this based on heuristics is to combine them with weights into a score... and that in this case, false negatives hurt much more than false positives. (I've always described bridgy as deliberately "promiscuous." :P)

I'm already second guessing all this added complexity, though, and it looks like the domain check is comfortably the strongest so far, so I'm kind of leaning toward just that. meh.

@kylewm
Copy link
Contributor

kylewm commented Sep 15, 2015

I'm already second guessing all this added complexity, though, and it looks like the domain check is comfortably the strongest so far, so I'm kind of leaning toward just that. meh.

I would support that too. Fight that sunk cost fallacy!

@ghost
Copy link

ghost commented Sep 22, 2015

I'm not sure if it's this issue. I came here when searching for the "No post links found" message in this repository. For me Bridgy behaves a bit odd. I have posted my links as usual to Google+ (manually from my Known instance) and the favorites are feeded back to my site as normal, but the replies are not with the message "No post links found". I checked my Google+ profile and https://stream.tinokremer.nl is mentioned. On my own Known instance, my Google+ profile is mentioned too and IndieAuth sees it as normal.

I'm puzzled why Bridgy cannot see post links, can you shed light on that @snarfed ?

2015-09-22_184303

@snarfed
Copy link
Owner Author

snarfed commented Sep 23, 2015

@tinokremer sorry for the trouble! you're right, it probably is due to this. current status: trying to track down the memory leak in #456 (comment), which is blocking further fixes here. wish me luck!

@ghost
Copy link

ghost commented Sep 23, 2015

Memory leaks are the hardest issues to solve and I'm a C# .Net developer. The reference system and garbage collector cleans up most of my mess. Good luck indeed!

snarfed added a commit to snarfed/granary that referenced this issue Sep 23, 2015
snarfed added a commit to snarfed/granary that referenced this issue Sep 24, 2015
matches same kwarg in bridgy's original_post_discovery.discover(). for snarfed/bridgy#51, snarfed/bridgy#485
snarfed added a commit that referenced this issue Sep 24, 2015
…ial ones

uses new include_redirect_sources kwarg in Source.original_post_discovery(). for #51, #485
@snarfed
Copy link
Owner Author

snarfed commented Sep 26, 2015

tentatively closing. this has been running in prod and stable for a few days. I'm sure there are more bugs left to fix, but we can open new issues for them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
7 participants