Link-rot and Schrödinger's URL

As this website has matured over the years, I regularly encounter bookmarks or links to sites that no longer exist. Motivated by some recent input on the topic (Remy Sharp via Jeremy Keith, an IndieWebCamp session, Jeremy Cherfas), I took a moment to evaluate how I may want to deal with the “link-rot” problem on this website myself.

Reviewing my workflows, they fall into two categories:

  1. the archival of resources I link to; ensuring that the inevitable vanishing of web resources won’t leave me unable to access the content (and keep it available for others, where content is public), and
  2. how to deal with such URLs that no longer work or point to other than the intended resource (regardless of whether or not I have an archived copy available).

This post documents both my thoughts on my desired solution and some code snippets from my current setup for reference (and re-use by others).

Part 1: Archiving during the authoring process

For quite some time, I have had various safeguards in place to ensure continued access to potentially ephemeral content (essentially any web content not under my control). This is achieved via a page:create hook in the Kirby CMS, triggering various mechanisms whenever storing a new bookmark. Unfortunately, this does not currently cover regular links in any text at this point – which could be achieved by a page:update hook and some suitable vault for that meta data.

Pinging the Wayback machine

First, a call to the Wayback Machine Availability API https://archive.org/wayback/available?url=<URL> checks whether archive.org already has this page saved and the JSON response’s closest property (an epoch timestamp) is no older than 24h, in which case that snapshot’s URL gets stored in my post’s metadata:

try {     $response = json_decode(file_get_contents('https://archive.org/wayback/available?url=' . $url), true); } catch (Exception $e) {     $response = false; } if ($response && array_key_exists('closest', $response['archived_snapshots']) && time() - strtotime($response['archived_snapshots']['closest']['timestamp']) < 86400) {     $responsed = $response['archived_snapshots']['closest']['url']; }

This check is in place because it is the most straightforward scenario to deal with (returns a plain URL to store right away), but also in order to not swamp the Internet Archive with endless copies of the same content.

If no recent snapshot exists, a call to the API at https://web.archive.org/save (this requires a free API key) returns a job_id to be stored in a local log file, along with the ID of the bookmark page in my system to later connect the two. (Edit: according to the Indieweb wiki, the old, simpler API has reliably been working again for quite some time, so an unauthorized call to https://web.archive.org/save/<URL> will synchronously and almost instantly return the snapshot’s URI, which of course reduces complexity):

try {     $ch = curl_init();     curl_setopt($ch, CURLOPT_URL, "https://web.archive.org/save");     curl_setopt($ch, CURLOPT_POST, 1);     curl_setopt($ch, CURLOPT_POSTFIELDS, http_build_query('url' => $url));     curl_setopt($ch, CURLOPT_HTTPHEADER, [         "Accept: application/json",         "Content-Type: application/x-www-form-urlencoded;charset=UTF-8",         "Authorization: LOW {$accesskey}:{$secretkey}",     ]);     curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);     $response = curl_exec($ch);     curl_close($ch);     $data = json_decode($response, true);     $jobid = $data['job_id']; } catch (Exception $e) {     // archive.org API request failed }

A cronjob then regularly polls my log of created “save requests”, and once a snapshot URL is returned, the bookmark post’s meta data is updated accordingly and the log entry gets deleted:

$response = file_get_contents("http://web.archive.org/save/status/{$job_id}"); $data = json_decode($response, true); if ($data['status'] == 'success') {     $archiveUrl = "https://web.archive.org/web/{$data['timestamp']}/{$data['original_url']}"; }

It is not uncommon for archival requests to time out or return an error, which needs to be dealt with as well as to not clog the cron system:

// the following uses Kirby-specific functions, checking the age of the job file in my cron setup elseif ($data['status'] == 'pending' && F::modified($logfolder . '/' . $job_id) + 43200 < time()) {     F::rename($logfolder . '/' . $job_id, $job_id . '.timeout'); } // if archive.org returns error, store error in file elseif ($data['status'] == 'error') {     F::rename($logfolder . '/' . $job_id, $job_id . '.error');     F::write($logfolder . '/' . $job_id . '.error', $data); }

Storing a local copy

The page:create hook for new bookmarks also makes a call to the Graby library, a tool to extract content from websites. It creates a local excerpt of the bookmarked page’s content area, which I then store away in the post’s meta folder. This is an additional local backup for my personal use only (due to copyright, these commonly cannot be published); alternatively, it would probably be possible to somehow create a full local copy or maybe download such from archive.org.

require __DIR__ . '/vendor/graby/vendor/autoload.php'; $graby = new Graby\Graby(); $result = $graby->fetchContent($url); if ($result['status'] == 200 && !empty($result['html']) && $result['html'] != "[unable to retrieve full-text content]") {     $content = $result['html']; }

Manual ‘til it hurts: PDFs etc.

If a bookmark contains a PDF resource (e.g. an academic paper) or is mainly about one particular visual (like an infographic, in which case I might take a screenshot for my personal use), I commonly download and attach it manually to the bookmark post. This allows me to keep a personal copy of such key resources.

Storing the favicon

Purely for presentational reasons, I recently also started to store a copy of the linked website’s favicon. It is an easy way to add a bit of a visual cue to displayed bookmarks, and is actually pretty straightforward, thanks to an open Google API returning the image, which can then be stored as an attachment to the bookmark post:

$domain = parse_url($url)['host']; $imageurl = 'https://www.google.com/s2/favicons?domain=' . $domain; $imagefile = Remote::get($imageurl);

Bonus: Screenshot for public posts

Currently, I am using a proprietary external service to generate screenshots, but only for bookmarks published on my journal. This is achieved through a Kirby page:publish hook and displayed in small size for decorative purposes. Ideally, I would eventually store a full resolution screenshot of every bookmark for my personal use, to also archive the current appearance of a website (particularly relevant for visual resources or references). Unfortunately, the ubiquitous cookie banners make automated screenshot creation difficult – I often have to manually optimize screenshots (or simply delete them).

Part 2: Dealing with “link-rot” when accessing a page

With my back somewhat covered as I create new entries, the display of outdated URLs remains a challenge. Currently, I have no automated process in place, and while that sounds appealing, it comes with various challenges:

Error-prone auto-detection

Some brief experimentation with feeding thousands of URLs from my website content into Linkchecker immediately led to the discovery of what I herein call “Schrödinger’s URL”: a web resource either exists or does not exist until the link is opened.

Same as there are various reasons for URLs to expire – from moved resources to deleted content or dead websites – there is an abundance of reasons why an automated test may return invalid results:

  • false positives (e.g. HTTP 200 returned, even though the original website is long gone and replaced by something different), or
  • false negatives (e.g. temporarily expired HTTPS certificates, a website that is down for maintenance, a site server blocking automated polling or throttling the amount of permitted requests, sites discouraging bot visits via robots.txt, …).

Reliably detecting whether a URL truly points to the original resource would require advanced algorithmic processing: likely some kind of evaluation whether the content still matches (roughly) the original, possibly even a crawl for new URLs of old content, in case some site’s structure has changed, etc.

Manual corrections, centralized

In Kirby, content is marked up in Markdown, with the addition of so called “Kirbytags” – since these are rendered by the CMS (and not the default Markdown parser), it is possible to override/extend their functionality.

// Markdown [Link text](https://www.example.com) // Kirby (additional leading space added to avoid getting rendered here) ( link: https://example.com text: Link text)

For manual overrides, I already extended my link Kirbytag to automatically check the linked URL against a table of known “dead links”, and replace any matches with an alternative URL. This is often an archive.org URL, but may also be any other URL (e.g. if the original author has since changed the structure of their website):

// proto code from a more complex context, to illustrate the principle if (array_key_exists($url, $arrayOfDeadLinks)) {     $string = str_replace($url, $arrayOfDeadLinks[$url], $string);     $string = str_replace('<a', '<a class="archived"', $string); }

Currently an entirely manual approach, this naturally only catches links I have noticed to be broken, and this does not scale at all.

Providing options

For the reasons mentioned above, automatically checking the existence of targets of links in my content is not the path I want to take. Instead, I believe that aiding users in case they encounter a dead link is the better approach: just like Wikipedia encourages, it would make sense to provide users with an additional link to an archived copy stored in my meta data. That way, a link leading to nowhere (or somewhere unexpected) is just a minor speed bump and users have a way to still find the archive.org copy.

My customized link Kirbytag has long had an additional archive attribute to manually provide an archive.org URL (mostly on particularly important resources).

// this in Kirby… (again, additional whitespace added to avoid rendering) ( link: https://originallink.com text: Visit this site archive: https://web.archive.org/xyz) // …turns into this output: <a href="https://originallink.com">Visit this site</a> <a href="https://web.archive.org/xyz">(archived)</a>

Now, based on these most recent considerations, I believe it would be worth extending the automated archival of my bookmarks to any links posted in my texts. The cronjob could then, ideally, either store these Original/Archive URL pairs in a centralized table, or – the technically more fragile but long-term more robust and performant solution – update the link Kirbytags in my content with the according archive attribute.

Maybe the display of the additional archive hyperlink could be timed, as to only display it on pages authored more than six months ago or so. This would significantly improve the legibility for fresher texts and only address this issue as more links are likely to fall outdated over time.

Conclusion: Retroactively fixing others’ URLs is hard

The approach outlined – heavily leaning on precautions during initial post creation – obviously only solves these issues for current and future publication. For the link-rot of older URLs, archive.org could be polled systematically for snapshot URLs from roughly that point in time (either using the timestamp option of the Wayback Machine Availability API or the more advanced CDX API), then fed into a centralized lookup table or updated in the text files directly.

What I like most about this approach is that the original link structures remain intact, while users have an easy way to both identify the existence of an archived copy and access it – offering choice is generally my preferred solution. After all, the web, and its URLs with it, are a volatile construct by nature; trying to fix that in a one-sided effort feels a bit like a labor of Sisyphus.

2023