Better datastructure for storing domains #1531

cowlicks · 2017-07-25T04:45:00Z

We currently use javascript objects, like the action_map, or arrays like the sitesDisabled
or the cookieblock list, to store domains.

But if, for example, we have a domain, and want to find if we have any of its
subdomains, we have to iterate over every key of the object, or element of the
array. This is O(n).

We do this sometimes, like in #1507, or a recent migration, or here, or in #1528.

A better data structure for these would be a prefix tree, where each nodes are
DNS "labels" (domain piece between dots). This would allow us to lookup domains
in O(1), and get all subdomains for a domain in O(1). A quick example that would
be a drop in replacement for the current storage api:

function Tree() {
  this._base = {};
}

Tree.prototype = {
  sentinel: '.',  // because '.' can't appear in DNS labels

  splitter: function(splitme) {
    return // split string input into an array however you want
  },
};

function setItem(tree, item, val) {
  let parts = tree.splitter(item),
    len = parts.length,
    node = tree._base;

  for (let i = 0; i < len; i++) {
    let part = parts[i];
    if (!node.hasOwnProperty(part)) {
      node[part] = {};
    }
    node = node[part];
  }
  node[tree.sentinel] = val;
};

function getItem(tree, item) {
  let parts = tree.splitter(item).concat(tree.sentinel),
    len = parts.length,
    node = tree._base;

  for (let i = 0; i < len; i++) {
    let part = parts[i];
    if (!node.hasOwnProperty(part)) {
      return undefined;
    }
    node = node[part];
  }
  return node;
};

I think the performance improvements from using this would be minor. I think
the real gains will be from our ability to easily get subdomains. I think we
have deliberately architected this project around the idea that getting
subdomains from a domain is hard. This has resulted in use using
window.getBaseDomain for everything that goes in storage, like action_map
and snitch_map.

This makes it very hard to debug issues, since our information is very limited.
We usually cant' find the get the FQDN of the tracker or the sites where they
were seen tracking.

With a datastructure like this we'd be able to store the FQDN of the tracker,
and the FQDN of the basedomain, and retreive them in an efficient way. Having
this information readily available would allow us to fix broken site issues
much more quickly.

This would also help us know that we are actually fixing the right bugs.

We are currently only blindly able to fix some site bugs, for examlpe in #1493

there is a blocked fqdn that is not tracking on the reported 1st party origin
the blocked domains' fqdn is different from its base domain
none of the basedomains in the snitch_map have the tracking basedomain. I don't even know if the snitch maps are derived from FQDN's.

So there is no obvious way to find where the original tracking actually happend
and reproduce it. I could go wondering around the internet looking for it
myself. I could further interrogate the bug. So what usually happens is something
gets added to the cookieblock list. None of these methods are sustainable.

So I think:

We need to have a better action_map to fix issues like this

I'd like to propose anotehr method for the api:

/**
 * preform a reduction along a branch
 * Callback looks like:
 * function(node, finish)
 */
function reduce(tree, memo, callback)

These wolud let us aggregate information along a branch of the tree. So for example we'd
be able get all the places a fqdn and all of its parent domains were seen
tracking efficiently.

I think a good place to introduce this data structure is in #1507, if it goes well
we can use it in storage. Related issues #1289, #1527, #963. This would also allow us
to re-asses #1515

The text was updated successfully, but these errors were encountered:

cowlicks · 2017-07-25T04:51:03Z

Related to #1299

ghostwords · 2017-09-27T19:58:02Z

There seem to be at least two things going on here:

Using a fancier data structure that lets us get subdomains of a domain more efficiently.
Having Privacy Badger store more information regarding its decisions to facilitate our debugging.

It's not clear to me the latter depends on the former. We may want to upgrade our data structures at some point, but we should avoid premature optimization.

ghostwords · 2017-10-16T17:04:42Z

Related to #266.

ghostwords · 2018-07-17T14:04:53Z

Can revisit when necessary, closing for now.

cowlicks added the important label Jul 25, 2017

This was referenced Jul 27, 2017

emoji.tapatalk-cdn.com #1493

Closed

Add a debug function to help users report broken site issues on github. #1430

Closed

This was referenced Aug 9, 2017

Are there any factors that could put something into an immediate red? #1566

Closed

Add dynamic TLD's to the MDFP list #1568

Closed

ghostwords added performance and removed important labels Sep 27, 2017

ghostwords added the low priority label Oct 16, 2017

ghostwords added enhancement and removed low priority labels Jul 17, 2018

ghostwords closed this as completed Jul 17, 2018

ghostwords added the wontfix label Jul 17, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better datastructure for storing domains #1531

Better datastructure for storing domains #1531

cowlicks commented Jul 25, 2017

cowlicks commented Jul 25, 2017

ghostwords commented Sep 27, 2017

ghostwords commented Oct 16, 2017

ghostwords commented Jul 17, 2018

Better datastructure for storing domains #1531

Better datastructure for storing domains #1531

Comments

cowlicks commented Jul 25, 2017

cowlicks commented Jul 25, 2017

ghostwords commented Sep 27, 2017

ghostwords commented Oct 16, 2017

ghostwords commented Jul 17, 2018