opt-in: heuristic research, trending techniques, community rulesets #1299

jawz101 · 2017-04-04T18:16:35Z

Something I've thought about and at the encouragement of @ghostwords I at least want to throw a ticket in to track discussion
original mention here
(#1244)

community learning. Option to pull down aggregated rule sets.
research into anti-privacy techniques
trending and declining techniques
sites with most tracking
real-world data to see if fingerprinting techniques are actually used
heuristic analysis for developers
additional data could be sent back to the extension ui.
encourage discussion
could compare real-world data to popular blocking extensions (Easylist) to see if they add too much overhead.
analyze performance overhead.
reduce number of reported issues if we can determine what sites people must manually allow to address site breakage. Could contribute to a dynamic exception list.
potentially submit actual blocked content for others to analyze. Say, "mysteryscript.js tracks across a bunch of domains for a bunch of people, what does this thing actually do?"

concerns

clearly state what data would be sent. (Show the data in its current state so the user knows what's up)
potential to disclose private/local servers.
potential to disclose sensitive public servers. Anything public is already public. It would be nice to know it's not associated with me.
trust/good faith if users don't understand the idea. Maybe consider a completely separate add-on or at least make a detailed description within an opt-in screen so people can see the benefit.

misterHippo · 2017-08-02T00:10:16Z

+1
Good ideas. Especially when expected display / functionality is broken due to false positives.

Cite: AWS Sticky Sessions Issue (Aka: WeatherZone Issue)

jawz101 · 2017-10-30T17:27:44Z

To address some of the privacy concerns if there was some sort of effort to do any sort of public logging for research, I was thinking about possibly respecting site robots.txt files but that really isn't a standard. Maybe looking into how some Internet archival projects determine what to exclude from their archival processes. Maybe to use a query to the Wayback Machine to see if they've archived the site in the past. https://archive.org/help/aboutsearch.htm

ghostwords · 2018-07-25T14:47:57Z

Previously: #1136.

We are going to go with shipping new Badgers with a freshly pre-trained database (#1891), which tackles some of the same issues community databases address.

Regarding research into what we should be blocking but aren't, our next step might be #2114.

I'm going to close this for now as I don't think it's likely we will work on community databases any time soon, given their complexity and that we are addressing the main pain points already in other ways. Please feel free to open new issues for anything else you think we should consider.

dkg · 2020-10-09T15:26:25Z

glad to see the consideration of community rulesets here. It's a hard problem, but definitely worth considering, as Badger Sett can't possibly observe every tracker (it would be sad if Privacy Badger is only useful by default for people who stick to the most popular websites and never venture off them).

Just wanted to note here that any implementation of community rulesets should include some clear policy and mechanism about how to detect and avoid an attempt to adversarially game the community ruleset. (e.g. submitting rival operators' domains or URLs to the community ruleset to make their systems malfunction for privacy badger users). These defenses aren't likely to be perfect, but they should be strong enough to disincentivize casual attacks like this, and to make it easy to undo the effect of any sophisticated attack that does make it through to users, once it is detected.

jawz101 · 2020-10-09T15:48:46Z

I'd rather have more community heuristics versus community lists of domain names. Anything that is used to gather and send environment/user behavior back to a server.

This might be how a site uses workers, websockets, service workers, requests to establish a persistent connection or un in the background.

And then if they use web APIs such as those which read the environment (bluetooth, proximity, light/ambient, screen orientation, etc.)

A lot of these standards were proposed by Google for Chrome tablets to be a web browser acting as an operating system. That's something that feels like a stretch for the web.

All I know is Privacy Badger is a great engine but it would be nice to have more parts in the engine to work from. Cookies are a dying thing and you cannot rely on weighing them as the only merit of a domain.

I mean fonts.googleapis.com has been in the top 20 or so most common DNS lookups in the world according to Cisco's DNS Umbrella for years. That isn't just for the sake of you getting free fonts.

dkg · 2020-10-09T16:40:01Z

@jawz101 i agree that community heuristics are a promising avenue, and i didn't mean to imply that i'm concerned just with community-contributed DNS or URL quasi-blacklists. It's still possible to adversarially game heuristics, for example, by identifying a particular unusual behavior of a competitor, and submitting it to the community-learning system. This kind of thing would be even harder to detect as an attack than a simple domain name or URL, which is why i want to encourage whoever is thinking about implementing such a system to be aware of the possibilities and think about how to defend against them.

ghostwords · 2020-10-09T17:55:40Z

Thanks all for the suggestions! Looking forward to sharing and discussing a proposal once we have more in place.

bcyphers · 2020-10-28T17:20:23Z

@jawz101 That's a cool suggestion, and I do think it would be neat to have something like that down the road. However, I don't think Privacy Badger is set up to handle that kind of thing right now. I hope this doesn't come off as overly negative, and please lmk if you had something else in mind.

From what you described, I'm imagining some kind of plug-in structure, where people can develop and share chunks of code that look for new kinds of tracking action. The problem is, there's no logical place (that I can think of) for a plugin like that to go. Privacy Badger's current heuristics live in various different places in code, and they interact with the rest of the extension in very different ways.

As a simple example, the tracking cookie heuristic is pretty straightforward. It's called from a listener which triggers on both outgoing requests and incoming responses; all it needs are the page URL, request URL, and cookies.

The fingerprinting heuristic is done completely differently, as a content script that's injected right into the main frame context. It instruments certain javascript calls, then calls back to Privacy Badger's background process when it sees something. In order to work, it needs contextual information about the page and the calling scripts, but it also needs to sit and wait for the javascript endpoints to actually be hit. It doesn't fit into the "intercept request -> analyze request -> log request" flow like the cookie heuristic does.

The supercookie heuristic is another content script, like the fingerprint heuristic, but it's only injected into IFrame contexts. For both, we've set up listeners in Privacy Badger's background thread to handle reports coming from other treads.

That is to say: thinking about Privacy Badger's heuristics as separate from the rest of the extension is difficult, since so much of privacy badger is the heuristics. Each new heuristic we've done so far is bespoke, and each needs access to different context, different privileges, and passes information back to the background thread in different ways. Some are synchronous with requests and some aren't. It might be possible to set up a general framework for heuristic modules -- contentscript, request listener, etc -- but it would be a much bigger lift than other improvements that we're eyeing right now.

For now, I think the best way for the community to contribute new heuristics is by submitting pull requests.

jawz101 · 2020-10-28T21:09:29Z

Oh I didn't take it negatively. I think your thoughts make sense to me. Privacy Badger has an engine that was originally purpose-built for training with cookies. It later got some fingerprinting and supercookie checks but they were bolt-ons.

It makes sense that a modular/plugin architecture to monitor other techniques might need to be a different tool. I just like the basic "if it's on 3 or more sites do something about it" approach.

ghostwords added enhancement heuristic Badger's core learning-what-to-block functionality privacy General privacy issues; stuff that isn't about Privacy Badger's heuristic labels Apr 4, 2017

jawz101 mentioned this issue Apr 14, 2017

Show where a domain was seen tracking #1289

Closed

This was referenced Jul 5, 2017

[feat request] A site to share some pre-set settings ready to import #1472

Closed

Bundle a list of known trackers #1333

Closed

cowlicks mentioned this issue Jul 25, 2017

Better datastructure for storing domains #1531

Closed

ghostwords mentioned this issue Sep 20, 2017

NOT an issue, just a question regarding PB operating mode #1680

Closed

jawz101 closed this as completed Oct 30, 2017

jawz101 reopened this Oct 30, 2017

ghostwords mentioned this issue Feb 22, 2018

Pre-train Badger on popular websites #1891

Closed

bcyphers mentioned this issue Apr 5, 2018

Pre-train badger on popular sites #1947

Merged

ghostwords closed this as completed Jul 25, 2018

ghostwords added the wontfix label Jul 25, 2018

ghostwords mentioned this issue Feb 18, 2020

Consolidate user blocking lists for DNS filtering? Make a spider with the badger's heuristics? #2555

Closed

ghostwords added important and removed wontfix labels Sep 10, 2020

ghostwords reopened this Sep 10, 2020

ghostwords mentioned this issue Sep 10, 2020

Disable local learning by default #2679

Merged

9 tasks

ghostwords mentioned this issue Nov 6, 2020

Send snitch_map with error reports #1901

Closed

bcyphers mentioned this issue Nov 12, 2020

Community learning discussion draft #2715

Open

ghostwords pinned this issue Dec 9, 2020

ghostwords removed the important label Dec 24, 2023

ghostwords unpinned this issue Dec 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

opt-in: heuristic research, trending techniques, community rulesets #1299

opt-in: heuristic research, trending techniques, community rulesets #1299

jawz101 commented Apr 4, 2017 •

edited

Loading

misterHippo commented Aug 2, 2017 •

edited

Loading

jawz101 commented Oct 30, 2017

ghostwords commented Jul 25, 2018

dkg commented Oct 9, 2020

jawz101 commented Oct 9, 2020 •

edited

Loading

dkg commented Oct 9, 2020

ghostwords commented Oct 9, 2020

bcyphers commented Oct 28, 2020

jawz101 commented Oct 28, 2020

opt-in: heuristic research, trending techniques, community rulesets #1299

opt-in: heuristic research, trending techniques, community rulesets #1299

Comments

jawz101 commented Apr 4, 2017 • edited Loading

misterHippo commented Aug 2, 2017 • edited Loading

jawz101 commented Oct 30, 2017

ghostwords commented Jul 25, 2018

dkg commented Oct 9, 2020

jawz101 commented Oct 9, 2020 • edited Loading

dkg commented Oct 9, 2020

ghostwords commented Oct 9, 2020

bcyphers commented Oct 28, 2020

jawz101 commented Oct 28, 2020

jawz101 commented Apr 4, 2017 •

edited

Loading

misterHippo commented Aug 2, 2017 •

edited

Loading

jawz101 commented Oct 9, 2020 •

edited

Loading