-
-
Notifications
You must be signed in to change notification settings - Fork 381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better datastructure for storing domains #1531
Labels
Comments
Related to #1299 |
This was referenced Jul 27, 2017
Closed
This was referenced Aug 9, 2017
There seem to be at least two things going on here:
It's not clear to me the latter depends on the former. We may want to upgrade our data structures at some point, but we should avoid premature optimization. |
Related to #266. |
Can revisit when necessary, closing for now. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
We currently use javascript objects, like the
action_map
, or arrays like the sitesDisabledor the cookieblock list, to store domains.
But if, for example, we have a domain, and want to find if we have any of its
subdomains, we have to iterate over every key of the object, or element of the
array. This is O(n).
We do this sometimes, like in #1507, or a recent migration, or here, or in #1528.
A better data structure for these would be a prefix tree, where each nodes are
DNS "labels" (domain piece between dots). This would allow us to lookup domains
in O(1), and get all subdomains for a domain in O(1). A quick example that would
be a drop in replacement for the current storage api:
I think the performance improvements from using this would be minor. I think
the real gains will be from our ability to easily get subdomains. I think we
have deliberately architected this project around the idea that getting
subdomains from a domain is hard. This has resulted in use using
window.getBaseDomain
for everything that goes in storage, likeaction_map
and
snitch_map
.This makes it very hard to debug issues, since our information is very limited.
We usually cant' find the get the FQDN of the tracker or the sites where they
were seen tracking.
With a datastructure like this we'd be able to store the FQDN of the tracker,
and the FQDN of the basedomain, and retreive them in an efficient way. Having
this information readily available would allow us to fix broken site issues
much more quickly.
This would also help us know that we are actually fixing the right bugs.
We are currently only blindly able to fix some site bugs, for examlpe in #1493
So there is no obvious way to find where the original tracking actually happend
and reproduce it. I could go wondering around the internet looking for it
myself. I could further interrogate the bug. So what usually happens is something
gets added to the cookieblock list. None of these methods are sustainable.
So I think:
We need to have a better
action_map
to fix issues like thisI'd like to propose anotehr method for the api:
These wolud let us aggregate information along a branch of the tree. So for example we'd
be able get all the places a fqdn and all of its parent domains were seen
tracking efficiently.
I think a good place to introduce this data structure is in #1507, if it goes well
we can use it in storage. Related issues #1289, #1527, #963. This would also allow us
to re-asses #1515
The text was updated successfully, but these errors were encountered: