large language model: Difference between revisions

Revision as of 21:32, 5 July 2024

This article is a stub. You can help the IndieWeb wiki by expanding it.

A large language model is (AKA LLM) usually a reference to a service, like OpenAI’s ChatGPT, that synthesizes text based on a massive set of prose typically crawled and indexed from the open web and other sources and should not be used to contribute content to the IndieWeb wiki; several IndieWeb sites disclaim any use thereof for their content.

Why

This section is a stub. You can help the IndieWebCamp wiki by expanding it.

Why disclaim

With more and more LLM-generated content being published to the web, e.g. by news outlets like Hoodline, people will increasingly look for actual human-written content instead, and disclaiming your use of LLMs will appeal to a larger and larger audience.

How to

This section is a stub. You can help the IndieWebCamp wiki by expanding it.

How to block

How to block LLM scrapers in 11ty: 2024-04-15 Blockin' Bots with Eleventy

IndieWeb Examples

In rough date order of adding to personal sites:

Hidde

Hidde de Vries added a ‘no LLMs involved’ note to the site-wide footer of hidde.blog from 17 March 2023

No language models were involved in writing the blog posts on here.

capjamesg

capjamesg since at least 2023-07-14(?) added a "Not By AI" image button on all his blog posts going back to at least 2020

Aaron Parecki

Aaron Parecki put a "Not by AI" badge in his global website footer (e.g. bottom of https://aaronparecki.com/) next to the IndieWeb/Microformats/Webmention buttons on 2023-10-15.

Tantek

Tantek Çelik put a general disclaimer about no use of LLMs for his site on his homepage on 2023-12-31:

https://tantek.com/#disclaimer
No large language models were used in the production of this site. (inspired by RFC 9518 Appendix A ¶ 4)

Paul Watson

Paul Watson put a general disclaimer about no use of LLMs for his site on his blog homepage [1] on 2024-01-04:

No large language models (LLM) or similar AI technologies were involved in writing the blog posts on here.

and also a similar statement and button (by https://notbyai.fyi/) on the footer of all pages on 2024-01-05.

Todd Presta

to2ds Updated the existing blurb above the fold on the homepage to use the term "LLMs" instead of the generic "AI." [2] on 2024-01-05. Also added (Not BY AI) written and painted badges to the common footer along with the IndieWeb badge and link.

... add yourself ...

Add yourself here… (see this for more details)

IndieWeb Wiki Examples

Do not use LLMs for content for the wiki:

in defintions: definition#Do_not_copy_from_LLM_generated_text
or content in general: wikifying#Do_not_copy_from_LLM_generated_text

Don't even think about using ChatGPT to contribute material to the wiki, because you don't have the ability to know you can contribute it to the public domain / CC0.

2024-01-08 The Guardian UK: ‘Impossible’ to create AI tools like ChatGPT without copyrighted material, OpenAI says

Other Examples

IETF example: https://www.rfc-editor.org/rfc/rfc9518.html#appendix-A-4
No large language models were used in the production of this document.
https://notbyai.fyi/

IndieWeb opinions

2023-03-13 Tantek Çelik: Blog as if there’s an #AI being trained to be you based on your blog posts.
Add yourself here… (see this for more details)

Criticism

Encourages disregarding copyright

https://t.co/1rmFNq7spR short link to 2024-04-06 NYT: How Tech Giants Cut Corners to Harvest Data for A.I. / OpenAI, Google and Meta ignored corporate policies, altered their own rules and discussed skirting copyright law as they sought online information to train their newest artificial intelligence systems.
2024-04-30 NYTimes: 8 Daily Newspapers Sue OpenAI and Microsoft Over A.I. / The suit, which accuses the tech companies of copyright infringement, adds to the fight over the online data used to power artificial intelligence.

Encourages training on private data

https://twitter.com/heyitsseher/status/1776823362292969880
Google quietly changed policies to scrape public Google Doc data to train AI.

Then purposely released new terms of service on "Fourth of July weekend, when people were typically focused on the holiday"

How Tech Giants Cut Corners to Harvest Data for A.I. https://t.co/1rmFNq7spR

Discourages open creativity and sharing

2024-03-11 Les Orchard: Dance like the bots aren't watching? / TL;DR: Why bother sharing anything on the open web if it's just going to be fodder for extractive, non-reciprocal bots?

2024-04-22 The Atlantic: It’s the End of the Web as We Know It
SEO will morph into LLMO: large-language-model optimization, the incipient industry of manipulating AI-generated material to serve clients’ interests.
[...]
LLMs aren’t people we connect with. Eventually, people may stop writing, stop filming, stop composing—at least for the open, public web.

Abused to waste developer time

Criticism: LLMs used to waste developer time with fake security bounty bug reports:

(from the maintainer of curl) 2024-01-02 The I in LLM stands for intelligence
Like for the email spammers, the cost of this ends up in the receiving end. The ease of use and wide access to powerful LLMs is just too tempting. I strongly suspect we will get more LLM generated rubbish in our Hackerone inboxes going forward.

Fails basic context dependence

Can fail basic context-dependence, like examples vs commands:

2024-06-06 Accidental prompt injection against RAG applications

Scraping for training widely rejected

85% of Cloudflare's customers prefer to block GenAI training scrapers:

2024-07-03 Declare your AIndependence: block AI bots, scrapers and crawlers with a single click

@@ Line 7: / Line 7: @@
 === Why disclaim ===
 With more and more LLM-generated content being published to the web, e.g. by news outlets like Hoodline, people will increasingly look for actual human-written content instead, and disclaiming your use of LLMs will appeal to a larger and larger audience.
 == IndieWeb Examples ==
@@ Line 39: / Line 44: @@
 * in defintions: [[definition#Do_not_copy_from_LLM_generated_text]]
 * or content in general: [[wikifying#Do_not_copy_from_LLM_generated_text]]
 == Other Examples ==
 * IETF example: https://www.rfc-editor.org/rfc/rfc9518.html#appendix-A-4 <blockquote>No large language models were used in the production of this document.</blockquote>
 * https://notbyai.fyi/
 == See Also ==
 * https://en.wikipedia.org/wiki/Large_language_model
 * [[ai;dr]]
-* Criticism: used to waste developer time with fake security bounty bug reports: (from the maintainer of [[curl]]) 2024-01-02 [https://daniel.haxx.se/blog/2024/01/02/the-i-in-llm-stands-for-intelligence/ The I in LLM stands for intelligence] <blockquote>Like for the email spammers, the cost of this ends up in the receiving end. The ease of use and wide access to powerful LLMs is just too tempting. I strongly suspect we will get more
-* 2024-01-08 The Guardian UK: [https://www.theguardian.com/technology/2024/jan/08/ai-tools-chatgpt-copyrighted-material-openai ‘Impossible’ to create AI tools like ChatGPT without copyrighted material, OpenAI says] <br/>So don't even think about using ChatGPT to contribute material to the wiki, because you don't have the ability to know you can contribute it to the public domain / CC0.
-* 2024-03-11 {{lmorchard}}: [https://blog.lmorchard.com/2024/03/11/dance-for-the-bots/ Dance like the bots aren't watching?] / TL;DR: Why bother sharing anything on the open web if it's just going to be fodder for extractive, non-reciprocal bots?
-* 2023-03-13 {{t}}: [https://tantek.com/2023/072/t1/blog-as-if-ai-trained-posts Blog as if there’s an #AI being trained to be you based on your blog posts.]
 * https://github.com/ai-robots-txt/ai.robots.txt
-* Criticism / harms of companies selling / allowing user content to be used for training LLMs: https://t.co/1rmFNq7spR short link to 2024-04-06 NYT: [https://www.nytimes.com/2024/04/06/technology/tech-giants-harvest-data-artificial-intelligence.html How Tech Giants Cut Corners to Harvest Data for A.I.] / OpenAI, Google and Meta ignored corporate policies, altered their own rules and discussed skirting copyright law as they sought online information to train their newest artificial intelligence systems.
-* ^ also tweeted: https://twitter.com/heyitsseher/status/1776823362292969880 <blockquote>Google quietly changed policies to scrape public Google Doc data to train AI.<br /><br />Then purposely released new terms of service on "Fourth of July weekend, when people were typically focused on the holiday"<br /><br />How Tech Giants Cut Corners to Harvest Data for A.I. https://t.co/1rmFNq7spR</blockquote>
-* Criticism: 2024-04-30 NYTimes: [https://www.nytimes.com/2024/04/30/business/media/newspapers-sued-microsoft-openai.html 8 Daily Newspapers Sue OpenAI and Microsoft Over A.I.] / The suit, which accuses the tech companies of copyright infringement, adds to the fight over the online data used to power artificial intelligence.
-* Criticism: 2024-04-22 The Atlantic: [https://www.theatlantic.com/technology/archive/2024/04/generative-ai-search-llmo/678154/ It’s the End of the Web as We Know It] <blockquote>SEO will morph into LLMO: large-language-model optimization, the incipient industry of manipulating AI-generated material to serve clients’ interests. <br/>[...]<br/>LLMs aren’t people we connect with. Eventually, people may stop writing, stop filming, stop composing—at least for the open, public web.</blockquote>
-* How to block LLM scrapers in [[11ty]]: 2024-04-15 [https://evilgeniuschronicles.org/posts/2024/04/15/blockin-bots-with-eleventy/ Blockin' Bots with Eleventy]
-* Criticism: can fail basic context-dependence, like examples vs commands: 2024-06-06 [https://simonwillison.net/2024/Jun/6/accidental-prompt-injection/ Accidental prompt injection against RAG applications]