Pre-train badger on popular sites #1947

bcyphers · 2018-04-05T23:14:27Z

Closes #1891, closes #971 and partially addresses #1019, #1299, #1374.

This PR includes a static JSON file with the snitch_map and action_map resulting from visiting the top 1,000 sites of the Majestic million (in order). On first run, the background page populates the user's action_map and snitch_map with those pre-trained values.

It's not opt-in (or even opt-out, for now) because it seems to me that there is no downside to starting off with this training data. We might want to add a "clear data" or "reset Privacy Badger" option so that users or researchers can start with a totally blank slate if they choose, but 99% of users should never have to think about that.

The script for doing the crawling will go in a separate repo. We should probably set it up to run every week or so and set PB to pull down new versions every so often, but for now, we can just update the static file in this repository before new releases.

We probably also need to update parts of the FAQ and first-run flow to explain some of this, and the things explaining that "Privacy badger won't block things right away" can be removed. This PR is not meant to be merged in yet.

Please give feedback! Is this the right way to go about the training data? How should we update it? What about the UI should change, if anything?

Edit: The crawling script is here: https://github.com/EFForg/badger-sett

Add a seed.json file, which contains the action_map and snitch_map that the Badger learns after visiting the top 1000 sites in the Majestic Million. Add code in background.js to load it on first run. Remove redundant check for incognito mode in background initialization.

Since Privacy Badger runs in "spanning" incognito mode, chrome.extension.inIncognitoContext should never return true when called from the background page. Remove that check.

bcyphers · 2018-04-06T17:32:02Z

This broke the assumptions behind a few of the selenium tests -- working on getting those fixed now.

Add resetStoredSiteData function to BadgerPen prototype, which clears learned action_map and snitch_map. Update selenium tests to call this where necessary in order to pass.

ghostwords · 2018-04-09T18:58:53Z

Should we also resolve #971 by exposing a reset option somewhere in the UI (and then have tests click that button instead of running JS)?

Add buttons which (1) reset the tracker lists to their default state and (2) clear the lists entirely, respectively, to the options page.

bcyphers · 2018-04-11T23:11:54Z

It seems easier to have the tests call the javascript directly. Clicking the reset button refreshes the page, which requires sleeping for a while to make sure other things don't break. I'll add a separate test in options_test.py to check that the buttons work correctly.

ghostwords · 2018-04-12T15:35:28Z

Where does the crawling script live? Shall we review the script together with this PR?

How will we maintain the seed database to start with? Have the crawling script commit the latest database to its own repository daily, and add a Makefile task to update our definitions from that repository? We should document this step in our release process checklist.

jawz101 · 2018-04-19T22:15:21Z

I've been playing around with training mine on a list from quantcast referenced in
#1891

My thoughts are still to maintain it by wiping the whole thing every time or after a certain timeframe. Otherwise old domains will collect.

I'd also do more than 1,000. Like if I loaded 1 page at a time giving 5 seconds on each page it could run through 10,000 in half a day. 1 million sites would take 57 days so I'm not suggesting that.

I also disabled images,webgl, webassembly, downloadable fonts, rtc, serviceworkers, and a few other things to save on bandwidth and computer resources.

After trying 100 or so if the top sites it still felt like I didn't see as many reds as I thought I'd get. Like maybe a few hundred. I also ran Firefox Lightbeam during it and that was pretty nasty. It would be interesting to run Lightbeam during the training, get a screenshot of that mess, and then run it after the training to see if PB makes a difference. And total download sizes too.

andresbase · 2018-04-20T00:01:53Z

Current observed behaviour:

User installs Privacy Badger.
List of third-parties to block from pre-trained is loaded.
Privacy Badger blocks accordingly and continues to learn.
User clicks on clear all tracker data.
List now shows zero.
User clicks on Reset browsing data.
Pre-trained list is loaded.
User clicks on clear all tracker data.
List now shows zero.
User imports list from old learned data.
User clicks on Reset browsing data.
Old learned data and Pre-trained list are merged.

andresbase · 2018-04-20T00:03:18Z

We should improve the message when the user clicks on "Reset browsing data" to explain that the data to be restored is the pre-built list. It's not exactly clear yet.

bcyphers · 2018-04-20T01:29:34Z

The crawling script is here: https://github.com/EFForg/badger-sett

My thoughts are still to maintain it by wiping the whole thing every time or after a certain timeframe. Otherwise old domains will collect.

@jawz101 The script will run from scratch each time, so every (day|week|however often we run it) the seed data will be completely refreshed. Is this what you mean?

This won't affect users after they download the extension unless they choose to click the "reset" button -- if you install PB now, then never reset your tracker list, you might have an out-of-date action map next year. But that's already the way things are.

Also, nice idea with Lightbeam! I want to try that myself and see what happens. It would be really cool to have some visual representations of how much Privacy Badger is doing.

User clicks on Reset browsing data.

Old learned data and Pre-trained list are merged.

@andresbase This is not the expected behavior. After step 11, only the pre-trained data should be present. Thanks for the catch, I'll look into it. I'll tinker with the notification language as well.

jawz101 · 2018-04-20T02:36:30Z

@bcyphers yeah that's what I mean. Here's an example of 25 sites I'd tested in the past comparing Firefox's Tracking Protection, Privacy Badger, and uBlock Origin in default and "medium mode." Kinda interesting to play with but it was alarming to get the visualization. As much as I love the idea of Privacy Badger it has much to be desired. I'm hoping this sort've commit will prime the pump so PB can compete. With uBlock Origin in Medium Mode (blocking 3rd party scripts and frames by default) I get nearly complete separation.

I've also drawn up something crude in the past which I thought might be neat.

andresbase · 2018-04-21T06:24:18Z

@jawz101 I think Lightbeam shows all third-parties without taking into account tracking. At least last time I checked I couldn't see tracking in some of the ones shown, but it's worth double checking.

jawz101 · 2018-04-22T15:10:53Z

fwiw I trained mine last night on about 1300 sites before I stopped just to see what I could see I exported the Lightbeam export file as well. I didn't take a screenshot of the visualization but imagine not seeing any of the black background in the Lightbeam screen. It was just the biggest mess in the world. (had to upload the json files as txt files for it to upload)

lightbeamData.json.txt
PrivacyBadger_user_data-4_22_2018_12_20_25_AM.json.txt

Add test_reset to options_test.py to test clicking on the 'reset data' and 'clear all' buttons in the options page.

Add calls to `clear_seed_data()` at the start of the new test_tracking_user_overwrite_* Selenium tests.

ghostwords · 2018-05-28T17:24:53Z

src/data/seed.json

+    }, 
+    "casalemedia.com": {
+      "dnt": false, 
+      "h


Nit: Could we get rid of trailing spaces?

At E&D suggestion, add a red border to "dangerous" buttons and make the background red brighter.

bcyphers · 2018-06-21T00:30:08Z

@ghostwords Looks like everywhere we use find_el_by_css now, we expect the element to be visible. I can think of situations when one might want to get an invisible (but present) element, but they don't seem likely for us. I added your suggestion in 8a2c2a9 and it seems to have fixed the issue.

ghostwords · 2018-06-28T19:37:47Z

Releasing this is blocked by #1972, I think. See #1947 (comment) and EFForg/badger-sett#17.

I removed wait_on_site in #2076

Probably related to the FOUC workaround (2a43f8f).

andresbase · 2018-06-28T23:50:14Z

Feedback from several users on this layout:

Split in Bullet points:

RESET

Reseting tracking domains will:

Delete all data about trackers that Privacy Badger has learned from your browsing

Restore the tracking domain list to its default state”
Privacy Badger will use the pre-trained list to block some tracking, but will need to continue learning to be more effective.

REMOVE ALL

Removing all tracking domains will:

Delete everything Privacy Badger knows about trackers

Cause Privacy Badger to not block anything, until it has had a chance to re-learn from your browsing

Screenshots

chromedriver is fast enough to sometimes load the options page before Badger finishes fetching the pre-trained database from disk.

ghostwords · 2018-06-29T18:37:55Z

@andresbase @bcyphers How do these look?

Reset confirmation screenshot

Remove all confirmation screenshot

andresbase · 2018-06-29T21:57:48Z

LGTM
For Reset: Should we link latter on to the blogpost? If so, to have it updated automatically we could put https://www.eff.org/badger-pretraining which now points to the repo but I can change when we publish the post to it.

ghostwords · 2018-07-01T19:52:54Z

Good idea, let's use a redirect URL we control.

ghostwords · 2018-07-01T20:09:32Z

Updated reset confirmation text:

ghostwords · 2018-07-01T20:15:56Z

Regarding #1972, another way it crops up is that _recordPrevalence (which we call to record tracking by a domain on a site) sets the domain and (its parent domain) to "allow" when snitch_map has fewer than constants.TRACKING_THRESHOLD entries.

Since snitch_map gets deleted for blocked/cookieblocked domains (because of #1972), when we see a new subdomain for an already-blocked/cookieblocked during pre-training domain, we end up setting that subdomain and its parent domain to "allow", which is not at all what we want.

Conflicts: tests/selenium/options_test.py Conflicts caused by 1f805af.

Don't fall back to boolean testing when a custom tester is present.

Need to reload the test page too when retrying since tabData doesn't get updated without a reload.

ghostwords · 2018-08-14T19:46:00Z

src/js/background.js

+    utils.xhrRequest(constants.SEED_DATA_LOCAL_URL, function(err, response) {
+      if (!err) {
+        var seed = JSON.parse(response);
+        self.storage.getBadgerStorageObject("action_map").merge(seed.action_map);


Should we use badger.mergeUserData here instead?

Maybe -- there will be the extra "version" key in the imported json, can the function handle that?

(yes, it can, and yes, we should.)

ghostwords

Looks good! I'm going to merge so I we can get started on updating translations. If we want to do anything about #1947 (comment), let's do a followup PR.

- Use pre-trained tracker data in new Privacy Badger installations. - Add buttons to reset/clear tracker data.

bcyphers added 2 commits April 5, 2018 15:53

Remove check for inIncognitoContext

9adfb21

Since Privacy Badger runs in "spanning" incognito mode, chrome.extension.inIncognitoContext should never return true when called from the background page. Remove that check.

bcyphers requested review from ghostwords and andresbase April 5, 2018 23:14

bcyphers added enhancement important heuristic Badger's core learning-what-to-block functionality labels Apr 5, 2018

Remove vestigial firstRun.js

007c179

Allow clearing badger storage data

83505b1

Add resetStoredSiteData function to BadgerPen prototype, which clears learned action_map and snitch_map. Update selenium tests to call this where necessary in order to pass.

Add "Reset," "Clear all" buttons to options page

f56e9ac

Add buttons which (1) reset the tracker lists to their default state and (2) clear the lists entirely, respectively, to the options page.

ghostwords added the translations label Apr 10, 2018

ghostwords removed enhancement heuristic Badger's core learning-what-to-block functionality important labels Apr 19, 2018

Add selenium test for reset and clear

b6a2f30

Add test_reset to options_test.py to test clicking on the 'reset data' and 'clear all' buttons in the options page.

bcyphers force-pushed the train-badger-on-popular-sites branch from 6023c18 to b6a2f30 Compare May 24, 2018 22:08

bcyphers added 2 commits May 24, 2018 16:02

Merge master

6497ba5

Fix user_overwrite Selenium tests

108d3ba

Add calls to `clear_seed_data()` at the start of the new test_tracking_user_overwrite_* Selenium tests.

ghostwords reviewed May 28, 2018

View reviewed changes

src/data/seed.json Outdated

},

"casalemedia.com": {

"dnt": false,

"h

Copy link

Member

ghostwords May 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Could we get rid of trailing spaces?

Change .btn-danger style

3de183b

At E&D suggestion, add a red border to "dangerous" buttons and make the background red brighter.

ghostwords added the tests label Jun 27, 2018

ghostwords added 6 commits June 28, 2018 16:31

Merge master.

c216dd9

Fix test failures caused by merge.

f64be2b

I removed wait_on_site in #2076

Fix options tests by waiting for tabs to appear.

0ac9ee1

Probably related to the FOUC workaround (2a43f8f).

Fix more test failures.

6a337a8

Fix git not ignoring some Python test caches.

8100fb8

Tweak button spacing under the Manage Data tab.

3282af1

Fix more sporadic options page test failures.

2fedbe4

chromedriver is fast enough to sometimes load the options page before Badger finishes fetching the pre-trained database from disk.

Update reset/remove all button confirmations.

d91fe8d

Update reset button confirmation text.

cc5c1a2

bcyphers mentioned this pull request Jul 9, 2018

Badger Settings Causes New Instance Of Browser Opened #2058

Closed

ghostwords added 3 commits August 13, 2018 18:06

Merge branch 'master'.

58c982e

Conflicts: tests/selenium/options_test.py Conflicts caused by 1f805af.

Fix retry_until logic for custom test methods.

faa8e78

Don't fall back to boolean testing when a custom tester is present.

Fix retrying in DNT unblocking test.

b18da5f

Need to reload the test page too when retrying since tabData doesn't get updated without a reload.

ghostwords reviewed Aug 14, 2018

View reviewed changes

ghostwords approved these changes Aug 14, 2018

View reviewed changes

ghostwords merged commit b18da5f into master Aug 14, 2018

ghostwords added a commit that referenced this pull request Aug 14, 2018

Merge pull request #1947.

82ec2d3

- Use pre-trained tracker data in new Privacy Badger installations. - Add buttons to reset/clear tracker data.

ghostwords deleted the train-badger-on-popular-sites branch August 14, 2018 21:09

ghostwords mentioned this pull request Aug 15, 2018

Load seed data the same way user data is imported #2134

Merged

ghostwords mentioned this pull request Jul 3, 2019

Audit pre-trained list for missing top known trackers #2114

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pre-train badger on popular sites #1947

Pre-train badger on popular sites #1947

bcyphers commented Apr 5, 2018 •

edited by ghostwords

Loading

bcyphers commented Apr 6, 2018

ghostwords commented Apr 9, 2018

bcyphers commented Apr 11, 2018 •

edited

Loading

ghostwords commented Apr 12, 2018 •

edited

Loading

jawz101 commented Apr 19, 2018 •

edited

Loading

andresbase commented Apr 20, 2018

andresbase commented Apr 20, 2018

bcyphers commented Apr 20, 2018 •

edited

Loading

jawz101 commented Apr 20, 2018 •

edited

Loading

andresbase commented Apr 21, 2018

jawz101 commented Apr 22, 2018

ghostwords May 28, 2018

bcyphers commented Jun 21, 2018

ghostwords commented Jun 28, 2018

andresbase commented Jun 28, 2018 •

edited

Loading

ghostwords commented Jun 29, 2018 •

edited

Loading

andresbase commented Jun 29, 2018

ghostwords commented Jul 1, 2018 •

edited

Loading

ghostwords commented Jul 1, 2018

ghostwords commented Jul 1, 2018 •

edited

Loading

ghostwords Aug 14, 2018

bcyphers Aug 15, 2018

bcyphers Aug 15, 2018

ghostwords left a comment

Pre-train badger on popular sites #1947

Pre-train badger on popular sites #1947

Conversation

bcyphers commented Apr 5, 2018 • edited by ghostwords Loading

bcyphers commented Apr 6, 2018

ghostwords commented Apr 9, 2018

bcyphers commented Apr 11, 2018 • edited Loading

ghostwords commented Apr 12, 2018 • edited Loading

jawz101 commented Apr 19, 2018 • edited Loading

andresbase commented Apr 20, 2018

andresbase commented Apr 20, 2018

bcyphers commented Apr 20, 2018 • edited Loading

jawz101 commented Apr 20, 2018 • edited Loading

andresbase commented Apr 21, 2018

jawz101 commented Apr 22, 2018

ghostwords May 28, 2018

Choose a reason for hiding this comment

bcyphers commented Jun 21, 2018

ghostwords commented Jun 28, 2018

andresbase commented Jun 28, 2018 • edited Loading

Split in Bullet points:

RESET

REMOVE ALL

ghostwords commented Jun 29, 2018 • edited Loading

andresbase commented Jun 29, 2018

ghostwords commented Jul 1, 2018 • edited Loading

ghostwords commented Jul 1, 2018

ghostwords commented Jul 1, 2018 • edited Loading

ghostwords Aug 14, 2018

Choose a reason for hiding this comment

bcyphers Aug 15, 2018

Choose a reason for hiding this comment

bcyphers Aug 15, 2018

Choose a reason for hiding this comment

ghostwords left a comment

Choose a reason for hiding this comment

bcyphers commented Apr 5, 2018 •

edited by ghostwords

Loading

bcyphers commented Apr 11, 2018 •

edited

Loading

ghostwords commented Apr 12, 2018 •

edited

Loading

jawz101 commented Apr 19, 2018 •

edited

Loading

bcyphers commented Apr 20, 2018 •

edited

Loading

jawz101 commented Apr 20, 2018 •

edited

Loading

andresbase commented Jun 28, 2018 •

edited

Loading

ghostwords commented Jun 29, 2018 •

edited

Loading

ghostwords commented Jul 1, 2018 •

edited

Loading

ghostwords commented Jul 1, 2018 •

edited

Loading