Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opt-in: heuristic research, trending techniques, community rulesets #1299

Open
jawz101 opened this issue Apr 4, 2017 · 9 comments
Open

opt-in: heuristic research, trending techniques, community rulesets #1299

jawz101 opened this issue Apr 4, 2017 · 9 comments
Labels
enhancement heuristic Badger's core learning-what-to-block functionality privacy General privacy issues; stuff that isn't about Privacy Badger's heuristic

Comments

@jawz101
Copy link
Contributor

jawz101 commented Apr 4, 2017

Something I've thought about and at the encouragement of @ghostwords I at least want to throw a ticket in to track discussion
original mention here
(#1244)

  1. community learning. Option to pull down aggregated rule sets.
  2. research into anti-privacy techniques
    trending and declining techniques
    sites with most tracking
    real-world data to see if fingerprinting techniques are actually used
  3. heuristic analysis for developers
  4. additional data could be sent back to the extension ui.
  5. encourage discussion
  6. could compare real-world data to popular blocking extensions (Easylist) to see if they add too much overhead.
  7. analyze performance overhead.
  8. reduce number of reported issues if we can determine what sites people must manually allow to address site breakage. Could contribute to a dynamic exception list.
  9. potentially submit actual blocked content for others to analyze. Say, "mysteryscript.js tracks across a bunch of domains for a bunch of people, what does this thing actually do?"

concerns

  1. clearly state what data would be sent. (Show the data in its current state so the user knows what's up)
  2. potential to disclose private/local servers.
  3. potential to disclose sensitive public servers. Anything public is already public. It would be nice to know it's not associated with me.
  4. trust/good faith if users don't understand the idea. Maybe consider a completely separate add-on or at least make a detailed description within an opt-in screen so people can see the benefit.
@ghostwords ghostwords added enhancement heuristic Badger's core learning-what-to-block functionality privacy General privacy issues; stuff that isn't about Privacy Badger's heuristic labels Apr 4, 2017
@misterHippo
Copy link

misterHippo commented Aug 2, 2017

+1
Good ideas. Especially when expected display / functionality is broken due to false positives.

Cite: AWS Sticky Sessions Issue (Aka: WeatherZone Issue)

@jawz101
Copy link
Contributor Author

jawz101 commented Oct 30, 2017

To address some of the privacy concerns if there was some sort of effort to do any sort of public logging for research, I was thinking about possibly respecting site robots.txt files but that really isn't a standard. Maybe looking into how some Internet archival projects determine what to exclude from their archival processes. Maybe to use a query to the Wayback Machine to see if they've archived the site in the past. https://archive.org/help/aboutsearch.htm

@ghostwords
Copy link
Member

Previously: #1136.

We are going to go with shipping new Badgers with a freshly pre-trained database (#1891), which tackles some of the same issues community databases address.

Regarding research into what we should be blocking but aren't, our next step might be #2114.

I'm going to close this for now as I don't think it's likely we will work on community databases any time soon, given their complexity and that we are addressing the main pain points already in other ways. Please feel free to open new issues for anything else you think we should consider.

@dkg
Copy link

dkg commented Oct 9, 2020

glad to see the consideration of community rulesets here. It's a hard problem, but definitely worth considering, as Badger Sett can't possibly observe every tracker (it would be sad if Privacy Badger is only useful by default for people who stick to the most popular websites and never venture off them).

Just wanted to note here that any implementation of community rulesets should include some clear policy and mechanism about how to detect and avoid an attempt to adversarially game the community ruleset. (e.g. submitting rival operators' domains or URLs to the community ruleset to make their systems malfunction for privacy badger users). These defenses aren't likely to be perfect, but they should be strong enough to disincentivize casual attacks like this, and to make it easy to undo the effect of any sophisticated attack that does make it through to users, once it is detected.

@jawz101
Copy link
Contributor Author

jawz101 commented Oct 9, 2020

I'd rather have more community heuristics versus community lists of domain names. Anything that is used to gather and send environment/user behavior back to a server.

This might be how a site uses workers, websockets, service workers, requests to establish a persistent connection or un in the background.

And then if they use web APIs such as those which read the environment (bluetooth, proximity, light/ambient, screen orientation, etc.)

A lot of these standards were proposed by Google for Chrome tablets to be a web browser acting as an operating system. That's something that feels like a stretch for the web.

All I know is Privacy Badger is a great engine but it would be nice to have more parts in the engine to work from. Cookies are a dying thing and you cannot rely on weighing them as the only merit of a domain.

I mean fonts.googleapis.com has been in the top 20 or so most common DNS lookups in the world according to Cisco's DNS Umbrella for years. That isn't just for the sake of you getting free fonts.

@dkg
Copy link

dkg commented Oct 9, 2020

@jawz101 i agree that community heuristics are a promising avenue, and i didn't mean to imply that i'm concerned just with community-contributed DNS or URL quasi-blacklists. It's still possible to adversarially game heuristics, for example, by identifying a particular unusual behavior of a competitor, and submitting it to the community-learning system. This kind of thing would be even harder to detect as an attack than a simple domain name or URL, which is why i want to encourage whoever is thinking about implementing such a system to be aware of the possibilities and think about how to defend against them.

@ghostwords
Copy link
Member

Thanks all for the suggestions! Looking forward to sharing and discussing a proposal once we have more in place.

@bcyphers
Copy link
Contributor

@jawz101 That's a cool suggestion, and I do think it would be neat to have something like that down the road. However, I don't think Privacy Badger is set up to handle that kind of thing right now. I hope this doesn't come off as overly negative, and please lmk if you had something else in mind.

From what you described, I'm imagining some kind of plug-in structure, where people can develop and share chunks of code that look for new kinds of tracking action. The problem is, there's no logical place (that I can think of) for a plugin like that to go. Privacy Badger's current heuristics live in various different places in code, and they interact with the rest of the extension in very different ways.

As a simple example, the tracking cookie heuristic is pretty straightforward. It's called from a listener which triggers on both outgoing requests and incoming responses; all it needs are the page URL, request URL, and cookies.

The fingerprinting heuristic is done completely differently, as a content script that's injected right into the main frame context. It instruments certain javascript calls, then calls back to Privacy Badger's background process when it sees something. In order to work, it needs contextual information about the page and the calling scripts, but it also needs to sit and wait for the javascript endpoints to actually be hit. It doesn't fit into the "intercept request -> analyze request -> log request" flow like the cookie heuristic does.

The supercookie heuristic is another content script, like the fingerprint heuristic, but it's only injected into IFrame contexts. For both, we've set up listeners in Privacy Badger's background thread to handle reports coming from other treads.

That is to say: thinking about Privacy Badger's heuristics as separate from the rest of the extension is difficult, since so much of privacy badger is the heuristics. Each new heuristic we've done so far is bespoke, and each needs access to different context, different privileges, and passes information back to the background thread in different ways. Some are synchronous with requests and some aren't. It might be possible to set up a general framework for heuristic modules -- contentscript, request listener, etc -- but it would be a much bigger lift than other improvements that we're eyeing right now.

For now, I think the best way for the community to contribute new heuristics is by submitting pull requests.

@jawz101
Copy link
Contributor Author

jawz101 commented Oct 28, 2020

Oh I didn't take it negatively. I think your thoughts make sense to me. Privacy Badger has an engine that was originally purpose-built for training with cookies. It later got some fingerprinting and supercookie checks but they were bolt-ons.

It makes sense that a modular/plugin architecture to monitor other techniques might need to be a different tool. I just like the basic "if it's on 3 or more sites do something about it" approach.

@ghostwords ghostwords pinned this issue Dec 9, 2020
@ghostwords ghostwords unpinned this issue Dec 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement heuristic Badger's core learning-what-to-block functionality privacy General privacy issues; stuff that isn't about Privacy Badger's heuristic
5 participants