Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design and implement a fancier heuristicblock data structure #266

Open
pde opened this issue Aug 18, 2014 · 10 comments
Open

Design and implement a fancier heuristicblock data structure #266

pde opened this issue Aug 18, 2014 · 10 comments
Labels
enhancement heuristic Badger's core learning-what-to-block functionality low priority privacy General privacy issues; stuff that isn't about Privacy Badger's heuristic

Comments

@pde
Copy link
Contributor

pde commented Aug 18, 2014

Currently the httpRequestOriginFrequency data structure needs to store fragments of browsing history locally within the browser, in order to detect which 3rd party domains are tracking. We should migrate to a fancier, higher-privacy data structure that prunes at the right time, uses HMAC fragments instead of origin names, and takes up less space.

@pde
Copy link
Contributor Author

pde commented Aug 18, 2014

@jcb82 expressed interest in designing this.

@jcb82
Copy link

jcb82 commented Aug 23, 2014

My understanding of what we want is a client-side data structure to store counts of pairs (first party, third party) which we'll call (x, y) such that when the same third party y has been detected for three or more different first parties, y is marked as a tracker domain. Except we don't want a fine-grained history to be stored. Any more information on the threat model and exactly how much leakage we can tolerate?

I would propose a two-level Bloom filter: keep N bloom filters around. When x, y is observed, compute H(x||i) for 0 < i < k. For each index i, update bloom filter i with value y. Once j of N bloom filters (for j =3k-e) have been updated with y, y is considered a tracking domain. We can tune i, N, k, j such that we have a high probability of marking tracking domains.

There might be a simpler approach, I'm not totally sure what the design requirements are here.

@cooperq
Copy link
Contributor

cooperq commented Sep 28, 2016

This was discussed in the bi-weekly developers meeting on 9/27.

Threat model: A local attacker would be able to view a subset of cleared history (only visited domains, not urls or times).
This solution does not actually help against that since the attacker could still see if the users has visited site n by taking H(n) and searching for it in privacy badger's database.
what does this stop: an attacker who can view pb data but doesn't know how to hash strings.

other option: respect history clearing events, and remove domains from snitch map when they are removed from history.
problem: this will break privacy badger if the user consistently clears their entire history.

@pde
Copy link
Contributor Author

pde commented Dec 15, 2016

We definitely should do this.

@pde
Copy link
Contributor Author

pde commented Dec 15, 2016

In particular, my instinct is that we should do the following two things:

  • When tracker domains get redlisted, remove the three first party entries that caused them to be redlisted.
  • Maybe replace the first party domain names with a 1-2 hex digit HMAC slice. That's a simple special case of a bloom filter; it will make the algorithm a bit slower to block some trackers, but not terribly so.
@pjlsergeant
Copy link

pjlsergeant commented Dec 15, 2016

respect history clearing events, and remove domains from snitch map when they are removed from history

I don't think this part is optional; I think this is a hard requirement if you are going to maintain the full domain name.

@pjlsergeant
Copy link

pjlsergeant commented Dec 15, 2016

Maybe replace the first party domain names with a 1-2 hex digit HMAC slice. That's a simple special case of a bloom filter; it will make the algorithm a bit slower to block some trackers, but not terribly so.

This is a great idea; it gives deniability, you can use a very very fast HMAC digest algorithm for it, and the only draw back I see is that once in x times you'll undercount unique domains. This is a winner, unless there's some technical drawback I'm not thinking of.

@pde
Copy link
Contributor Author

pde commented Dec 15, 2016

I don't think this part is optional; I think this is a hard requirement if you are going to maintain the full domain name.

I agree that leaving cleartext history samples around is unacceptable if the user clears their history. But I've been wondering if 1-hex-digit records are acceptable in that case.

There are some corner cases where a small amount of information about browsing history may be genuinely inferrable from those records. For instance, if a third party domain is on a grand total of 5 first party domains across the Web, then thirdparty-exampe.com : ["5", "A", "C"] probably allows an attacker to infer three visited first parties from the five possibilities. But such cases would be quite rare, and it isn't clear that these records are much worse than having thirdparty-example.com in our blocklist, which is more profoundly unavoidable.

@pjlsergeant
Copy link

The corner cases still leave you with a deniability factor that the plaintext domains don't. I think the single(or double)-character hash here is a clear winner.

@ghostwords
Copy link
Member

I don't see a way to be notified of history-clearing events at this time without asking for the "history" permission, which will trigger a new permission warning, which seems unacceptable for Privacy Badger at this point (will lose a significant percentage of users in Chrome, and have a similar percentage stop upgrading in Firefox).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement heuristic Badger's core learning-what-to-block functionality low priority privacy General privacy issues; stuff that isn't about Privacy Badger's heuristic
6 participants