How to disagree with google’s privacy policy

Yesterday I read a toot about google’s new privacy policy: google reserves the right to use any public content to train their AIs. The crazy thing about this change in their privacy policy is, of course, that it somehow gives them permission to do so, even if you never use any of their services. Simply by existing they think they have the right to use content on my website.

I have been looking for ways to not allow companies to use my stuff without asking, and so far I coulnd’t find any. But since this policy change I realised that there is a simple one: block google’s bots from visiting your website. There are a few ways to do this, but by far the simplest is to add a file with the name robots.txt to the root of your website. In it you add these lines:

User-agent: Googlebot
Disallow: /

This tells google’s bots that they are not welcome, and I guess it’s the only way to tell them that you disagree with their privacy policy and reject it.

Weirdly enough this robots.txt doesn’t remove pages from their search results. If you want all your pages out of their index then you must first add this meta tag to the header of every page on your website:

<meta name="googlebot" content="noindex">

This is a lot of work if you have many different HTML pages on your site. There’s a way to do this on a server level as well. If you use Nginx then you can add this header to the configuration — it should work in a similar way with other web servers.

server {
    add_header X-Robots-Tag "googlebot: noindex, noarchive" always;
}

It can take quite some time, even months, before all pages disappear from google’s index. They disappear one by one whenever a google bot visits an indexed page again. When they’re all gone you can safely add the robots.txt.

But what about SEO?

Many people care about their search ranking on google. I don’t. Google’s search results are pretty bad to begin with. There’s no clear distinction between results based on content and paid results, which makes it completely untrustworthy. You should never use their search engine (as you should probably never use any of their services). There are better alternatives, like Duckduckgo and Kagi.

And what about OpenAI and other companies ?

As far as I know there’s no way to stop them from stealing your content. If you find a way, please let me know.

Update

You can block the chatgpt bot from crawling your site but adding these lines to your robots.txt

User-agent: GPTBot
Disallow: /

Update

You can speed this de-google process up a little, but then you need to use a URL removal tool that google offers. The problem is, of course, that you explicitly have to agree with google’s privacy policy when you use that tool…

Comments

  1. While this may work today, disallowing Googlebot via robots.txt doesn’t necessarily block Google from harvesting your data (or indeed using data it has previously harvested/added to the index). If Google wants to, it could simply disregard your robots.txt entry. Currently Google has an incentive to respect your choice (communicated through the use of the robots tag/robots.txt, but if they’re to become intent on subsuming the web’s content and information therein, to be served directly as if it’s their own, this incentive is lost and it seems entirely possible they’ll simply harvest all the content they can.

    A (slightly) more robust solution would be to block requests made by the Googlebot useragent, but this again is tenuous as it would be easy to spoof public traffic.

    A more robust solution would be needed, something along the lines of bot management involving pattern recognition (potentially AI powered).

    • Vasilis
    • #

    @Sam That’s all true, of course. I’ve been thinking about blocking all known IP-addresses of google bots on a server level. There is an official list google updates every day. I might try that some day.

    For now I think that using robots.txt is the best way to actively tell google that you disagree with their “privacy” policy, at least on a conceptual level. In practice google will of course keep on doing whatever the fuck it wants to do.