How AI systems trained on AI-generated data lead to erosion and collapse

In late July 2024, Nature published a paper by Ilia Shumailov and Zakhar Shumaylov that rattled the AI world. The Oxford University computer scientists showed that feeding AI-generated data (known as synthetic data) to an AI model caused subsequent generations of that model to degrade to the point of collapse.

This morning, New York Times reporter Aatish Bhatia published a remarkable follow-up to that work. Bhatia asked a simple question: What would that erosion look like, exactly?

Using simple hand-written numerals, Bhatia fed an AI system its own output. He started with this data:

After training the AI model on 20 generations of its own output (ie, synthetic data used again and again), the data looked like this:

After training the AI model on 30 generations of synthetic data, the output had eroded into unintelligible blurs:

Bhatia writes:

While this is a simplified example, it illustrates a problem on the horizon.

Imagine a medical-advice chatbot that lists fewer diseases that match your symptoms, because it was trained on a narrower spectrum of medical knowledge generated by previous chatbots. Or an A.I. history tutor that ingests A.I.-generated propaganda and can no longer separate fact from fiction.

Just as a copy of a copy can drift away from the original, when generative A.I. is trained on its own content, its output can also drift away from reality, growing further apart from the original data that it was intended to imitate.

Aatish Bhatia’s simple, brilliant journalistic experiment illustrates both the value of high-quality organic data and the risks inherent in AI systems trained on data of unknown provenance.

The blurring of random numerals may not seem like a big deal. But when that same erosion happens within AI systems making medical or financial decisions, people’s lives may be profoundly harmed.

At the Transparency Coalition we’re working to enact policies that daylight basic information about the quality of data on which AI models are trained. This not only protects personal data privacy and intellectual property. It helps to ensure that AI models are trained on high-quality data and operate properly for the greater good of society.

Previous
Previous

California approves Training Data Transparency Act, now on Gov. Newsom’s desk

Next
Next

Transparency Coalition weighs in on California’s SB 1047: Watchdog or lapdog?