When the Internet Archive Forgets

On the internet, there are certain institutions we have come to rely on daily to keep truth from becoming nebulous or elastic. Not necessarily in the way that something stupid like Verrit aspired to, but at least in confirming that you aren’t losing your mind, that an old post or article you remember reading did, in fact, actually exist. It can be as fleeting as using Google Cache to grab a quickly deleted tweet, but it can also be as involved as doing a deep dive of a now-dead site’s archive via the Wayback Machine. But what happens when an archive becomes less reliable, and arguably has legitimate reasons to bow to pressure and remove controversial archived material?

A few weeks ago, while recording my podcast, the topic turned to the old blog written by The Ultimate Warrior, the late bodybuilder turned chiropractic student turned pro wrestler turned ranting conservative political speaker under his legal name of, yes, “Warrior.” As described by Deadspin’s Barry Petchesky in the aftermath of Warrior’s 2014 passing, he was “an insane dick,” spouting off in blogs and campus speeches about people with disabilities, gay people, New Orleans residents, and many others. But when I went looking for a specific blog post, I saw that the blogs were not just removed, the site itself was no longer in the Internet Archive, replaced by the error message: “This URL has been excluded from the Wayback Machine.”

Apparently, Warrior’s site had been de-archived for months, not long after Rob Rousseau pored over it for a Vice Sports article on the hypocrisy of WWE using Warrior’s image for their Breast Cancer Awareness Month campaign. The campaign was all about getting women to “Unleash Your Warrior,” complete with an Ultimate Warrior motif, but since Warrior’s blogs included wishing death on a cancer-survivor, this wasn’t a good look. Rousseau was struck by how the archive was removed “almost immediately after my piece went up, like within that week,” he told Gizmodo.

Rousseau suspected that WWE was somehow behind it, but a WWE spokesman told Gizmodo that they were not involved. Steve Wilton, the business manager for Ultimate Creations also denied involvement. A spokesman for the Internet Archive, though, told Gizmodo that the archive was removed because of a DMCA takedown request from the company’s business manager (Wilton’s job for years) on October 29, 2017, two days after the Vice article was published. (He has not replied to a follow-up email about the takedown request.)

Over the last few years, there has been a change in how the Wayback Machine is viewed, one inspired by the general political mood. What had long been a useful tool when you came across broken links online is now, more than ever before, seen as an arbiter of the truth and a bulwark against erasing history.

That archive sites are trusted to show the digital trail and origin of content is not just a must-use tool for journalists, but effective for just about anyone trying to track down vanishing web pages. With that in mind, that the Internet Archive doesn’t really fight takedown requests becomes a problem. That’s not the only recourse: When a site admin elects to block the Wayback crawler using a robots.txt file, the crawling doesn’t just stop. Instead, the Wayback Machine’s entire history of a given site is removed from public view.

In other words, if you deal in a certain bottom-dwelling brand of controversial content and want to avoid accountability, there are at least two different, standardized ways of erasing it from the most reliable third-party web archive on the public internet.

For the Internet Archive, like with quickly complying with takedown notices challenging their seemingly fair use archive copies of old websites, the robots.txt strategy, in practice, does little more than mitigating their risk while going against the spirit of the protocol. And if someone were to sue over non-compliance with a DMCA takedown request, even with a ready-made, valid defense in the Archive’s pocket, copyright litigation is still incredibly expensive. It doesn’t matter that the use is not really a violation by any metric. If a rightsholder makes the effort, you still have to defend the lawsuit.

“The fair use defense in this context has never been litigated,” noted Annemarie Bridy, a law professor at the University of Idaho and an Affiliate Scholar at the Center for Internet and Society at Stanford Law School. “Internet Archive is a non-profit, so the exposure to statutory damages that they face is huge, and the risk that they run is pretty great … given the scope of what they do; that they’re basically archiving everything that is on the public web, their exposure is phenomenal. So you can understand why their impulse might be to act cautiously even if that creates serious tension with their core mission, which is to create an accurate historical archive of everything that has been there and to prevent people from wiping out evidence of their history.”

While the Internet Archive did not respond to specific questions about its robots.txt policy, its proactive response to takedown requests, or if any potential fair use defenses have been tested by them in court, a spokesperson did send this statement along:

Several months after the Wayback Machine was launched in late 2001, we participated with a group of outside archivists, librarians, and attorneys in the drafting of a set of recommendations for managing removal requests (the Oakland Archive Policy) that the Internet Archive more or less adopted as guidelines over the first decade or so of the Wayback Machine.

Earlier this year, we convened with a similar group to review those guidelines and explore the potential value of an updated version. We are still pondering many issues and hope that before too long we might be able to present some updated information on our site to better help the public understand how we approach take down requests. You can find some of our thoughts about robots.txt at http://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines-dont-work-well-for-web-archives/.

At the end of the day, we strive to strike a balance between the concerns that site owners and rights holders sometimes bring to us with the broader public interest in free access for everyone to a history of the Internet that is as comprehensive as possible.

All of that said, the Internet Archive has always held itself out to be a library; in theory, shouldn’t that matter?

“Under current copyright law, although there are special provisions that give certain rights to libraries, there is no definition of a library,” explained Brandon Butler, the Director of Information Policy for the University of Virginia Library. “And that’s a thing that rights holders have always fretted over, and they’ve always fretted over entities like the Internet Archive, which aren’t 200-year-old public libraries, or university-affiliated libraries. They often raise up a stand that there will be faux libraries, that they’d call themselves libraries but it’s really just a haven for piracy. That specter of the sort of sham library really hasn’t arisen.” The lone exception that Butler could think of was when American Buddha, a non-profit, online library of Buddhist texts, found itself sued by Penguin over a few items that they asserted copyright over. “The court didn’t really care that this place called itself a library; it didn’t really shield them from any infringement allegations.” That said, as Butler notes, while being a library wouldn’t necessarily protect the Internet Archive as much as it could, “the right to make copies for preservation,” as Butler puts it, is definitely a point in their favor.

That said, “libraries typically don’t get sued; it’s bad PR,” Butler says. So it’s not like there’s a ton of modern legal precedent about libraries in the digital age, barring some outliers like the various Google Books cases.

As Bridy notes, in the United States, copyright is “a commercial right.” It’s not about reputational harm, it’s about protecting the value of a work and, more specifically, the ability to continuously make money off of it. “The reason we give it is we want artists and creative people to have an incentive to publish and market their work,” she said. “Using copyright as a way of trying to control privacy or reputation … it can be used that way, but you might argue that’s copyright misuse, you might argue it falls outside of the ambit of why we have copyright.”

We take a lot of things for granted, especially as we rely on technology more and more. “The internet is forever” may be a common refrain in the media, and the underlying wisdom about being careful may be sound, but it is also not something that should be taken literally. People delete posts. Websites and entire platforms disappear for business and other reasons. Rich, famous, and powerful bad actors don’t care about intimidating small non-profit organizations. It’s nice to have safeguards, but there are limits to permanence on the internet, and where there are limits, there are loopholes.

You May Also Like

Wayback Machine Outage Caused by ‘Environmental Factors’ as Heat Wave Hammers the U.S.

AI Companies Would Have to Fess Up on What They Use to Train AI Under Proposed Law

Fiction Analytics Site Prosecraft Shut Down After Author Backlash

Supreme Court Says Andy Warhol Ripped Off Photographer in Copyright Case

Google Pushes Australia to Write Friendly AI Copyright Laws

Yuga Labs Claims Its Bored Apes Have Copyright, Even if It Never Filed for Protection