Synthetic Data and AI ‘Model Collapse’
Just as a photocopy of a photocopy can drift away from the original, when generative AI is trained on its own synthetic data, its output can also drift away from reality, growing further apart from the organic data that it was intended to imitate.
This is why large organic datasets are so coveted by AI model developers. Synthetic data is like plastic fruit: It looks just as good as the juicy carbon-based original but it’s hollow and inert. No taste. No texture. No spark of life.
The limitations of synthetic data were recently brought to light in July 2024, when Nature published a paper by Ilia Shumailov and Zakhar Shumaylov that rattled the AI world. The Oxford University computer scientists showed that feeding AI-generated data (aka synthetic data) to an AI model caused subsequent generations of that model to degrade to the point of collapse.
A few weeks later, New York Times reporter Aatish Bhatia published a remarkable follow-up to that work. Bhatia asked a simple question: What would that erosion look like, exactly?
this is what synthetic data erosion looks like
Using simple hand-written numerals, Bhatia fed an AI system its own output. He started with this data:
After training the AI model on 20 generations of its own output (ie, synthetic data used again and again), the data looked like this:
After training the AI model on 30 generations of synthetic data, the output had eroded into unintelligible blurs:
Bhatia wrote:
“While this is a simplified example, it illustrates a problem on the horizon.
Imagine a medical-advice chatbot that lists fewer diseases that match your symptoms, because it was trained on a narrower spectrum of medical knowledge generated by previous chatbots. Or an A.I. history tutor that ingests A.I.-generated propaganda and can no longer separate fact from fiction.”
this is why training data quality is important
Aatish Bhatia’s simple, brilliant journalistic experiment illustrates both the value of high-quality organic data and the risks inherent in AI systems trained on data of unknown provenance.
The blurring of random numerals may not seem like a big deal. But when that same erosion happens within AI systems making medical or financial decisions, people’s lives may be profoundly harmed.
At the Transparency Coalition we’re working to enact policies that daylight basic information about the quality of data on which AI models are trained. This not only protects personal data privacy and intellectual property. It helps to ensure that AI models are trained on high-quality data and operate properly for the greater good of society.