Notes on responsible web crawling

Published on June 05, 2024 under the Coding category.

In my blog post brainstorming a new indie web search engine, I noted that running a web search engine is hard. With that in mind, I started to think that I haven’t written too much about what I learned about web crawling when running IndieWeb Search, a search engine for the indie web. IndieWeb Search crawled a whitelist of websites, searching for pages, and indexed them for use in the search engine.

One challenge in particular when running search engines is ensuring that your search engine doesn’t break someone’s website. For example, suppose you are indexing a personal website. If your logic to parse URLs is incorrect, your crawler may spiral out of control and start crawling many pages that don’t exist. A potential side-effect of this is that you put undue burden on someone’s server, causing their site to be slower.

This is why I outlined an indie web search engine that works by reading feeds rather than crawling (also called “spidering”.

Crawlers go from page to page and look for new links. They retrieve each page and index it. This is repeated until either a whole site has been indexed, or a crawl budget has been met.

As soon as you get into crawling, there are many technical considerations to implement a responsible web crawler.

With that said, I wanted to document what I learned in building a web search engine, and some of the ideas I have had since then pertaining to responsible web crawling. Below are some of the things you should do to ensure that your web crawler is responsible. There are likely other considerations that apply at different scales to the one on which I was working (~1,000 websites, ~500,000 pages).

URL canonicalisation

Ensure you have strong URL canonicalisation logic. This logic should take any URL on a site and normalise it into a standard form. For example, all the following URLs are valid, but equivalent when the domain being crawled is https://jamesg.blog:

https://jamesg.blog
https://jamesg.blog/
/

If you crawled all three of these pages, you have crawled three pages when you only needed to crawl one. This gets more complicated if a site has URL parameters that may or may not impact the page substantially. I decided to strip all URL parameters if I recall, but a large-scale search engine should respect them and identify whether URL parameters should be crawled and at what rate. That is out of scope for this guide.

You should have well-tested logic that ensures URLs are canonicalized properly. Technically equivalent URLs should be consolidated. When you discover a new URL, it should be canonicalized to ensure you haven’t already crawled it. If you haven’t crawled the URL, you can put it in a queue. Of note, this is not related to rel=canonical. rel=canonical is a page stating that a given URL is canonical, but that is a consideration you arrive at when requesting a page.

With poor URL canonicalization, you may end up crawling subtantially more pages than you need.

Redirects

Your search engine should have robust measures in place to manage redirects. Limit the number of redirects any URL can give. If you have a white list, don’t crawl any URL that does not have a hostname allowed on your list. You should use a pre-existing library to check if hostnames match. Indeed, in general, lean on what others have written to write your crawler.

For example, suppose a site is misconfigured and a URL that is /example/// (with three slashes at the end) redirects to one with two slashes at the end (/example//), and that URL redirects back to the one with three slashes. Your crawler should be stop traversing these redirects and move on.

More tips

When you crawl a site, you should:

Respect robots.txt. There are many parsers available that let you check if a URL is covered under a robots.txt policy given the user agent under which your search engine operates. Related: declare a user agent publicly and provide guidance on how people can limit or restrict crawls from your search engine. You are not a responsible search engine if you don’t provide a clear means for people to limit crawling without their having to explicitly block your search engine.
Respect the Retry-After header that states you should retry crawling a page after a certain period.
Apply a timeout when you crawl URLs.
Acknowledge 429s and make sure you update your crawl queue to prioritize other URLs that have not returned a 429.
Look out for high incidence rates of 500s. This may indicate a site is running into stress, or has other technical issues. 500s are not useful in indexing, so you should back off and try again later.
Crawl multiple sites at once, rather than crawling entire sites sequentially. If a site has 100,000 valid pages, you don’t want to allocate all of your crawl capacity to that site all at once. Instead, you should crawl multiple sites at the same time. This will reduce the risk of running into 429s or causing problems.
Have per-site crawl budgets. This could vary depending on the site. If you are making a small search engine, you may only choose to crawl 1,000 URLs from a site; for a larger search engine, this number may increase substantially.
Use Last-Modified headers to check if a site has been modified since your last request.
If a server is slowing down as you crawl, but no 429s are advertised, consider moving URLs from that site further back in your queue.

These are some of the many considerations you should take into account when building a search engine that spiders between URLs.

If you don’t crawl web sites, and instead only download the content of feeds, the above considerations are less significant. This is why I outlined such an approach in a brainstorm for a new indie web search engin., If you are crawling 1000 feeds, and you only download the feed URLs rather than all the posts in each feed individually, there is substantially less risk of bringing down someone’s site than if you are downloading thousands of URLs from their site. The considerations above are still useful if you decide to download any pages linked in a feed, though.

If you have implemented a search crawler and have written advice on this topic, let me know by email. I can add a link to your post here.