Skip to content

Instructions for Lab Members Performing Crawls

Sebastian Zimmeck edited this page Jun 21, 2024 · 5 revisions

This page provides guidance for running and saving data from a crawl of our full 11,708-site dataset. Performing a full crawl consists of crawling crawl-set-pt1.csv - crawl-set-pt8.csv (our crawl set divided into 8 batches), redo-sites.csv (which you create based on these results), and collecting the well-known data using the well-known-collection.py python script.

A list of things to check before starting a batch of the full crawl:

  1. The correct crawl sites csv file name is in local-crawler.js line 12 (i.e., make sure you are crawling the right sites).

  2. The REST API is started with the debug table on.

  3. The entries and debug tables are cleared (do this in phpmyadmin by navigating to Operations and selecting TRUNCATE at the very bottom).

    Screenshot 2024-02-06 at 8 42 38 PM
  4. The VPN is on and connected to an LA IP address.

  5. The extension's xpi file is updated to the most recent version.

  6. The Mac's display settings ensure that the display won't turn off:

    Screenshot 2024-02-06 at 8 41 30 PM
  7. Everything is cleared from the error-logging folder (this isn't mandatory, just convenient to know which screenshots are from the current crawl. error-logging.json will be overwritten anyways).

Saving crawl data when crawling our 8-batch dataset:

  1. Create a folder to hold all crawl data. For example, our December crawl data folder was called Crawl_Data_Dec_2023.

  2. Inside that folder, create a folder to hold the data for each batch. For example, the first batch's folder should be named pt1.

  3. After that batch is finished, save analysis data, debugging data, error-logging folder, and terminal output. The following are examples for naming the first batch:

    • Saving analysis data: load http://localhost:8080/analysis. Select all data with ctrl-a. Save it to the pt1 folder using ctrl-s. It must be named analysis-pt1.json.
    • Saving debugging data: load http://localhost:8080/debug. Save it to the pt1 folder using ctrl-s. For consistency, name it debug-pt1.json
    • Saving error-logging folder: Leave all names as they are and copy the error-logging folder into the pt1 folder.
    • Saving terminal output: name it Terminal Saved Output-pt1.txt. We need this to determine the total crawl duration.

    The general folder setup should look like this:

    Screenshot 2024-02-12 at 2 44 54 PM
  4. Upload all crawl data to the Web Crawler folder in Google drive, either incrementally as each batch finishes or all at the end. Keep the file structure created here.

  5. After the 8 batches are done, the redo sites need to be identified and run. (redo sites are sites that had an error and a subdomain. We redo those sites without the subdomain). The following colab will identify those sites (modify path1 to be the path to the new data). Two files will be created: redo-sites.csv and redo-original-sites.csv; the first contains the sites you will crawl again (i.e., without the subdomains), and the second contains the same sites but with their subdomains (used to identify the exact sites later). Put both of these files into the equivalent of Crawl_Data_Dec_2023 in the drive.

  6. Crawl redo-sites.csv, and save the usual data, substituting redo in for pt1 in folder and file names. The redo data will override the initially collected data in the analysis colab.

  7. Run the well-known-collection.py script as described in the readme to collect the well-known data.

  8. Create a well-known folder in the equivalent of Crawl_Data_Dec_2023 in the drive. Put the well-known-data.csv and well-known-errors.json output files into this folder. Save the terminal output as Terminal Saved Output-wellknown.txt and put it into this folder.

    When in doubt, follow the file structure from previous crawls. All data and analysis code can be found here in the drive.

Parsing/analyzing crawl data:

After the full crawl is done and the data is saved in the correct format, parse the data using this colab. The parsed data will appear in this google sheet. Graphs for that month can be created by running this colab. Graphs comparing data from multiple crawls can be created using this colab. Figures are automatically saved to this folder. This colab serves as a library for the other colabs.

Google Drive Web_Crawler directories and files:

  • crawl-set-pt1.csv/ -- crawl-set-pt8.csv/: Files of 8 batches of crawl set.
  • Crawl_Data_Month_Year_ (e.g., Crawl_Data_April_2024): Folders of the result of our past crawls.
  • Crawl_Data: A file that compiles all the Crawl Data accumulated over the series of crawls (a compiled version of the Crawl_Data_Month_Year_).
  • sites_with_GPP : A file that collates all the sites with GPP (as of December 2023); this analysis is now reflected as statistics as a figure in Processing_Analysis_Data.
  • Ad_Network_Analysis: A file that has the result of the manual analysis of up to 70 ad networks' privacy policies.
  • Web_Crawl_Domains_2023: A file that has a collation of detailed information regarding sites in our crawl set (i.e., their ad networks, contact information, Tranco ranks).
  • Collecting_Sites_To_Crawl: A folder with files that explain and justify our methodology and process of the collection of sites to crawl (ReadMe and Methodology).
  • similarweb: A folder of our analysis that process the SimilarWeb data and determine what Tranco rank would have sufficient traffic to be subject to the CCPA.
  • GPC_Detection_Performance: A folder of ground truth data collection on validation sets of sites for verifyiing the USPS and GPP strings via the USPAPI value, OptanonConsent cookie, and GPP string value, each before and after sending a GPC signal.
  • Processing_Analysis_Data: A folder that has all the colabs for parsing, processing and analyzing the crawl results and the figures created from the analysis.

What to do if a crawl fails in the middle:

If a crawl fails in the middle of a batch (i.e., completely stops running), just restart from where it left off. This means change the following line in local-crawler.js for (let site_id in sites) { to for (let site_id = x; site_id < sites.length; site_id++){ //replace x with the site_id you want + 1. You can determine the last input site id by looking in the analysis database and start after that one. Before you start crawling again, rename the existing error-logging.json to any other file name. Otherwise, you will lose all errors up until the crawl failed; these errors are necessary for parsing in the colabs. (error-logging.json is overwritten with the error object stored in local-crawler.js every time there is an error). After the whole batch has successfully completed, manually merge the error json files so that all errors for the batch are in error-logging.json.

General info about data analysis in the colabs:

GPP String decoding:

The GPP String encoding/decoding process is described by the IAB here. The IAB has a website to decode and encode GPP strings. This is helpful for spot checking and is the quickest way to encode/decode single GPP strings. They also have a JS library to encode and decode GPP strings on websites. Because we cannot directly use this library to decode GPP strings in Python, we converted the JS library to Python and use that for decoding (Python library found here). The Python library will need to be updated when the IAB adds more sections to the GPP string. More information on updating the Python library and why we use it can be found in issue 89. GPP strings are automatically decoded using the Python library in the colabs.