How generative models will pollute their successors

Jun 19, 2023

Generative models like ChatGPT and Stable Diffusion are making it much easier for everyone to create high-quality content. But what happens when the internet becomes flooded with AI-generated content?

According to a study by researchers at several universities, machine learning models trained on content generated by generative AI will suffer from irreversible defects that gradually exacerbate across generations.

Key findings:

Generative models approximate the distribution of the data used to train them
When generative models are used to generate training data for their successors, the distribution of that data will be slightly different from the original
When the process is repeated across multiple generations of models, the model’s behavior will change in erroneous ways
Eventually, this will cause “model collapse,” a degenerative process whereby, over time, models forget the true underlying data distribution
Model collapse can happen across all kinds of ML models, including LLMs, diffusion, VAEs, etc.
The only way to prevent model collapse is to make sure to preserve access to the original data over time
As the web becomes flooded with AI-generated content, access to human-generate data will become a new front for competition between tech companies

Read the full article on TechTalks.

For more on AI research:

Andrew Smith

"Garbage In, Garbage Out" comes to mind, but... one computer's trash is another computer's treasure. I wonder if we are watching human language evolve into "chunks of idealized text that represent complex ideas the best way they can." Instead of words, we'll have chunks. I'm not sure if this is good or bad, but it is very, very different.

Expand full comment

TechTalks

How generative models will pollute their successors

Discussion about this post