Categories
Featured Technology The Internet Websites

Pulling my site from Google over AI training

ADDED 4 October 2023:

Google has announced a new token you can block to exclude your website from training Bard and Vertex AI: Google-Extended. To block your site from being used to train Google’s AI products, you should include this code in your robots.txt file:

# Google AI
User-agent: Google-Extended
Disallow: /

As a standalone token, that means that we don’t need to block Google from indexing our websites to block them from using our content to train their AI products.

⭐ ADDED 11 December 2023:

Except!!!! Google-Extended applies to their products but not their generative search results. So if you don’t want your content to appear in generative search results, you still need to block Googlebot.

 

ORIGINAL ARTICLE (published 11 July 2023):

After thinking about it for a couple days, I’ve decided to de-index my website from Google. It’s reversible — I’m sure Google will happily reindex it if I let them — so I’m just going ahead and doing it for now. I’m not down with Google swallowing everything posted on the internet to train their generative AI models. I was pushed over the edge by posts from Jeremy Keith and Vasilis van Gemert, thanks y’all.

I don’t have Google Search Console set up for this website so I don’t know how much search traffic I get. My other blog, Cascadia Inspired, got about 200 hits in the past three months. I’m not going to cry over that — they’re mostly going to one 2015 article anyway (and probably not that helpful of a post, to my eye. Around New Year’s every year I usually get an influx of people to my ten-year-old guide to doing a creative annual review. Sorry folks, I’m sure someone else has written something better by now.) 😉

I’m going to start by pulling my websites out of Google search, then work on adding my sites to directories. Maybe I’ll even join a webring 💍✨

Adding a noindex meta tag to my WordPress header

Because my website has already been indexed by Google, I need to allow the Google bot to re-crawl the pages and see the new “noindex” instruction. So in the future I’ll also block the Googlebot crawler, but not just yet 😉

I added this code to the functions.php file of my child theme:

add_action( 'wp_head', function() {
global $page;
echo '<meta name="Googlebot" content="noindex, nofollow, noimageindex">';
});

I figured out how to adapt this from WPExplorer. This random wordpress plugin help forum suggested another version, I don’t know which is better 🤷‍♀️

I’m not 100% on whether the noimageindex is actually helpful for Googlebot since that’s their text bot, but can’t hurt right? (Tell me if it hurts lol.) Yoast says there’s a better way to block image indexing but I’m scared of touching the .htaccess file and definitely nothing with my server 😂 (I’m on shared hosting anyway, so I think the edits I can make are limited?)

Blocking bots that collect training data for AIs (and more)

In addition, I created a robots.txt file to tell “law abiding” bots what they’re not allowed to look at. I ought to have done this before but kind of assumed it came with my WordPress install 😅 (Nope.)

AI user agents to block

There’s so many now, just copy from my robots file tbh.

ADDED 4 October 23: To block training of Google’s Bard, I blocked Google-Extended.

I specifically want to deter my website being used for training LLMs, so I blocked Common Crawl.

To block OpenAI, I blocked both user agents ChatGPT-User and GPTBot. (Added GPTBot 10 August 23)

ADDED 4 October 23: Per Neil Clarke’s article, I have also blocked Omgilibot, Omgili, and FacebookBot. (Via Jeremy Keith)

ADDED 14 February 2024: I also blocked user agents used in AI training sets: anthropic-ai, Bytespider, FacebookBot, and PerplexityBot (source)

ADDED 16 April 2024: prompted by Ethan Marcotte, I blocked several more known and suspected user agents used in AI training: Claude-Web, ClaudeBot, cohere-ai, Diffbot, YouBot, ChatGPT

Added 17 June 2024: I’ve now blocked Apple’s AI training bot Applebot-Extended (thanks for the heads-up James!) Does anyone else feel like this is getting ridiculous?

I also blocked Amazonbot and applebot to block Siri and Alexa’s “smart answers.” I believe this also excludes me from Apple search.

I’ve also now blocked Googlebot and bingbot in protest of their generative AI search results — I’ve had the code up for my pages to be deindexed by Google for over six months and I’m done waiting anymore.

Dark Visitors apparently has a WordPress plugin to update your robots.txt whenever a new agent comes out, but for now I’m stickin’ with manual. I am also still wary of modifying my .htaccess file and breaking something, so it’s just my robots.txt making my stance clear — I can’t control whether companies have any sort of ethics and comply, unfortunately.

Other user agents

Searching on DuckDuckGo, I found an older article from a theme maker with specific advice for WordPress robots.txt. From there I jumped to Jeff Star’s recommendations from 2020.

I also appreciate fellow opinionated individuals on the internet so I followed some other blocks from Rohan Kumar. I would happily take more opinionated suggestions of junk bots to block if anyone else has opinions or can point me to a list somewhere 😉

Note: this article generated a lot of interest! See a Hacker News discussion.

 

Syndicated to IndieWeb News

By Tracy Durnell

Writer and designer in the Seattle area. Reach me at tracy.durnell@gmail.com. She/her.

24 replies on “Pulling my site from Google over AI training”

I’m a sci-fi writer, graphic designer and urbanist in the Seattle suburbs. Reading and blogging are my favorite pasttimes and I’m an advocate of the indie web. On this site, I track what I read and watch, write commentary on things that interest me, and collect reference information. I’m curious about everything from technology to…

What do I want the future of the Internet to look like? Last updated 2024 May 19 | More of my big questions Sub-questions What do I want out of the Internet? What’s a better way to use the Internet? How can I support the independent web? What are the social norms around blogging and…



Stuff I Did:

14 hours writing — refined two blurbs and iterated the heck out of my outline
De-indexed this website from Google (started the process anyway 😉)
Completed my Q2 check-in
Went to Homebrew Website Club and blogged about the barriers to a more social IndieWeb
Installed two new tiers of wire for our espalier apples and tied the branches down — still need to do some aggressive pruning
Finally posted on LinkedIn about my new consulting business and got a referral for a gig from a colleague 🙌
Sewed a button back on my favorite dress 🪡
Baked brownies from a box and banana bread
Dropped the car back off at the shop, to get the AC fixed this time 🙄 They said it’s not their fault it’s now leaking oil, but it worked until the last time we dropped it off 😒

Dinners:

Baked potatoes with Moroccan chickpea curry 👎
Meze: Israeli couscous “tabbouleh” + marinated carrots + olives
BBQ bean sliders + coleslaw + sweet potato fries
Baked ziti
Meze: leftover couscous, cheese, nuts, apple, hummus and pita chips, olives and pickled peppers
Seven layer dip + chips
Panang curry 🤩

Reading:

Read Marie Kondo’s Kurashi at Home, Hot As Hades by Alisha Rai, A Thief in the Night by KJ Charles, and Ana María and the Fox by Liana De la Rosa
Re-read Dragon Bound by Thea Harrison
Ordered new copies of Sister Outsider by Audre Lorde, The Once and Future Sex by Eleanor Janega, The Extended Mind by Annie Murphy Paul and Smitten Kitchen Keepers by Deb Perelman
Ordered used copies of The Art of Activism, Understanding Media by McLuhan, White by Kenya Hara and The Care Manifesto

Words I looked up / concepts I learned:

bonheur (via Alex Sirac)
stochastic terrorism (via Jason Kottke)
febrile
termagant
bibliomane
the Venetian color pavonazzo (via Erin)

Website of the Week:

Question Mark, Ohio

Tracy,

I can understand your decision, but based off my reading of your post, it seems like you don’t understand some key fundamental things. There is this concept of credibility, and when dealing with deceitful people credibility is very important. If they aren’t credible, you can’t trust that they won’t do whatever is in their best interest. Google has said they are going to use anything on the open internet to train their AI regardless of the content owners stance, regardless of robots.txt or some indicator stating your wishes.

De-indexing in my opinion doesn’t do much since they’ve already said they won’t follow things like robots.txt (i.e. they have no credibility). Meta no-index is just another flag like robots.txt. You’d have to do a lot of work identifying their crawlers, and serving those requests fake useless data at your cost, or get really creative and make a non-deterministic input to break determinism on their code reading your site.

Hi Dundir, thanks for your concern.

I freely acknowledge the futility of the gesture. This post may not address it*, but I do recognize that Google has the power in this scenario. They are under no obligation to honor robots.txt or noindex instructions. They can and will, I’m sure, consume everything I publish regardless of anything I do short of making my site private. But, I am making clear that they are doing so without permission. Physical businesses can 86 someone; likewise, I can disallow their crawler from the website that I pay for. They are not invited here; they are breaking and entering with intent to steal. I simply don’t have enforcement power.

I know it doesn’t matter what my opinion of fair use is. Our laws were not designed with this kind of technology in mind, and it’s very possible corporations will win all their court cases over training data. Even if they do, I still don’t have to believe it is fair or right for anyone to steal my intellectual property to use it to create a competing product. We have many unjust laws that favor corporations over individuals.

All I can do is raise my hand and say, I do not consent. I don’t have to accept their theft without complaint — and because I’ve published my complaint online, it’s public and visible. I can bear witness and protest the ethics and legality of non-consensual data use. I will never win a technical battle against a corporation, but they can’t silence me when I have my own website. It is a double-edged sword: my writing is available to steal, but it’s also available to read. By de-indexing, I am declaring that I don’t need them — I am putting my trust in human curators over search. But they do need “me” (as in, people writing original content and publishing it online). Yes, I’m a silly idealist, but I’m not going to let certain failure stop me protesting injustice against myself. This is a hopeless righteous effort, but I make it out of pride for my work and its worth.

* Frankly, I didn’t anticipate many people seeing this post, and chiefly intended it as a reference for others with WordPress sites who might want to do the same. If I’d known it would hit Hacker News, I would have spelled out a lot more of this sentiment 😉

I mentioned the other day about de-indexing most of my sites from Google, but didn’t actually mention how to do that. Well, here’s how. As well as the above, I also block some bots via CloudFlare, though the fact their WAF requires whole specific user agent strings makes it less useful (albeit more “nuclear option”) […]

Stuff I did: Got my COVID booster and flu shot! Our car’s in the shop again 🙄, so we walked over in the rain 😑 We left early so it would still be light out, since Walgreens is on a busy state route, and grabbed Thai takeout to eat at a nearby park with a…

Leave a Reply