Categories
Featured Technology The Internet

AI has poisoned its own well

Replied to The Curse of Recursion: Training on Generated Data Makes Models Forget (arXiv.org)

What will happen to GPT-{n} once LLMs contribute much of the language found online? We find that use of model-generated content in training causes irreversible defects in the resulting models, where tails of the original content distribution disappear. We refer to this effect as Model Collapse and show that it can occur in Variational Autoencoders, Gaussian Mixture Models and LLMs. We build theoretical intuition behind the phenomenon and portray its ubiquity amongst all learned generative models. We demonstrate that it has to be taken seriously if we are to sustain the benefits of training from large-scale data scraped from the web. Indeed, the value of data collected about genuine human interactions with systems will be increasingly valuable in the presence of content generated by LLMs in data crawled from the Internet.

I suspect tech companies (particularly Microsoft / OpenAI and Google) have miscalculated, and in their fear of being left behind, have released their generative AI models too early and too wide. By doing so, they’ve essentially established a threshold for the maximum improvement of their products due to the threat of model collapse. I don’t think the quality that generative AI will be able to reach on a poisoned data supply will be good enough to get rid of all us plebs.

They need an astronomical amount of training data to make any model better than what already exists. By releasing their models for public use now, when they’re not very good yet, too many people have pumped the internet full of mediocre generated content with no indication of provenance. Stack Overflow has thrown up their hands and said they can’t moderate generative AI content, meaning the site can no longer be a training source for coding material. Publishers of formerly reputable sites are laying off their staff and experimenting with AI-generated articles. There is no consistent system for marking up generated content online that will allow companies to trust material of unknown origin as training data. Because of this approach, 2022 and 2023 will be essentially “lost years” of internet-sourced content, even if they can establish a tagging system going forward — and get people hostile or ambivalent to them to use it.

In their haste to propagandize the benefits of generative AI and encourage adoption so widespread it can’t be stopped, they’re already encouraging people to lean on LLMs to write for them. Microsoft is plugging AI tools into their flagship Office Suite software. Writing well is a skill that few possess today, and these companies are creating an environment where even fewer people will bother to learn and practice professional writing. As time goes by, there will be less human-created material (especially of quality and complexity) available as new training data.

Obtaining quality training data is going to be very expensive in five years if AI doesn’t win all its lawsuits over training data being fair use. By allowing us a glimpse into their vision and process, they’ve turned nearly every professional artist and writer against them as an existential threat. Even if they do win their fair use lawsuits, it may be a challenge to access the data; every creative person who relies on their work for pay will do everything they can to prevent their creations from becoming future training data.

Even worse, these companies’ misuse of the internet commons — of humanity’s collective creativity — as fuel for their own profit could lead to fragmentation and closing off of online information to prevent its theft. Bloggers don’t want their words stolen, and social media companies are getting wise to the value of “their” data and beginning to charge for API access. The difficulty and cost of gathering sufficient high quality training data for future models will incentivize continued use of whatever is easiest to grab, only hastening model collapse and increasing the likelihood of malicious actors perpetrating poisoning attacks.

 

A discussion on Hacker News

By Tracy Durnell

Writer and designer in the Seattle area. Reach me at tracy.durnell@gmail.com. She/her.

10 replies on “AI has poisoned its own well”

What do I want the future of the Internet to look like? Last updated 2024 May 19 | More of my big questions Sub-questions What do I want out of the Internet? What’s a better way to use the Internet? How can I support the independent web? What are the social norms around blogging and…

We need our regulators and legislators working on this now. Provide protection for content creators so that AI cannot train without conforming to some rule. Creative Commons contemplated this back in 2021.

I think this is a fascinating and interesting take, and echos what I have been thinking (so thank you for writing something way more eloquently that I ever could!).

[…] AI has poisoned its own well – Tracy Durnell. “When an AI consumes data it hallucinates, it gets dumb, fast. AI has a malnutrition problem, which is a major challenge for the growth of generative AI. There aren’t good solutions (sidenote: I’m actually working on a proposal for one; more details soon, hopefully.) But it’s an issue Tracy Durnell explains perfectly in this blog post.” (Alistair for Hugh). […]

Leave a Reply