Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better datastructure for storing domains #1531

Closed
cowlicks opened this issue Jul 25, 2017 · 4 comments
Closed

Better datastructure for storing domains #1531

cowlicks opened this issue Jul 25, 2017 · 4 comments

Comments

@cowlicks
Copy link
Contributor

We currently use javascript objects, like the action_map, or arrays like the sitesDisabled
or the cookieblock list, to store domains.

But if, for example, we have a domain, and want to find if we have any of its
subdomains, we have to iterate over every key of the object, or element of the
array. This is O(n).

We do this sometimes, like in #1507, or a recent migration, or here, or in #1528.

A better data structure for these would be a prefix tree, where each nodes are
DNS "labels" (domain piece between dots). This would allow us to lookup domains
in O(1), and get all subdomains for a domain in O(1). A quick example that would
be a drop in replacement for the current storage api:

function Tree() {
  this._base = {};
}

Tree.prototype = {
  sentinel: '.',  // because '.' can't appear in DNS labels

  splitter: function(splitme) {
    return // split string input into an array however you want
  },
};

function setItem(tree, item, val) {
  let parts = tree.splitter(item),
    len = parts.length,
    node = tree._base;

  for (let i = 0; i < len; i++) {
    let part = parts[i];
    if (!node.hasOwnProperty(part)) {
      node[part] = {};
    }
    node = node[part];
  }
  node[tree.sentinel] = val;
};

function getItem(tree, item) {
  let parts = tree.splitter(item).concat(tree.sentinel),
    len = parts.length,
    node = tree._base;

  for (let i = 0; i < len; i++) {
    let part = parts[i];
    if (!node.hasOwnProperty(part)) {
      return undefined;
    }
    node = node[part];
  }
  return node;
};

I think the performance improvements from using this would be minor. I think
the real gains will be from our ability to easily get subdomains. I think we
have deliberately architected this project around the idea that getting
subdomains from a domain is hard. This has resulted in use using
window.getBaseDomain for everything that goes in storage, like action_map
and snitch_map.

This makes it very hard to debug issues, since our information is very limited.
We usually cant' find the get the FQDN of the tracker or the sites where they
were seen tracking.

With a datastructure like this we'd be able to store the FQDN of the tracker,
and the FQDN of the basedomain, and retreive them in an efficient way. Having
this information readily available would allow us to fix broken site issues
much more quickly.

This would also help us know that we are actually fixing the right bugs.

We are currently only blindly able to fix some site bugs, for examlpe in #1493

  • there is a blocked fqdn that is not tracking on the reported 1st party origin
  • the blocked domains' fqdn is different from its base domain
  • none of the basedomains in the snitch_map have the tracking basedomain. I don't even know if the snitch maps are derived from FQDN's.

So there is no obvious way to find where the original tracking actually happend
and reproduce it. I could go wondering around the internet looking for it
myself. I could further interrogate the bug. So what usually happens is something
gets added to the cookieblock list. None of these methods are sustainable.

So I think:

We need to have a better action_map to fix issues like this


I'd like to propose anotehr method for the api:

/**
 * preform a reduction along a branch
 * Callback looks like:
 * function(node, finish)
 */
function reduce(tree, memo, callback)

These wolud let us aggregate information along a branch of the tree. So for example we'd
be able get all the places a fqdn and all of its parent domains were seen
tracking efficiently.

I think a good place to introduce this data structure is in #1507, if it goes well
we can use it in storage. Related issues #1289, #1527, #963. This would also allow us
to re-asses #1515

@cowlicks
Copy link
Contributor Author

Related to #1299

@ghostwords
Copy link
Member

There seem to be at least two things going on here:

  • Using a fancier data structure that lets us get subdomains of a domain more efficiently.
  • Having Privacy Badger store more information regarding its decisions to facilitate our debugging.

It's not clear to me the latter depends on the former. We may want to upgrade our data structures at some point, but we should avoid premature optimization.

@ghostwords
Copy link
Member

Related to #266.

@ghostwords
Copy link
Member

Can revisit when necessary, closing for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment