Transparency and Synthetic Data
The use of synthetic data isn’t inherently good or bad. In medical research, for example, it’s a critically important tool that allows scientists to make new discoveries while protecting the privacy of individual patients.
At the Transparency Coalition, we are not calling for limitations on the creation or use of synthetic data. What’s needed is disclosure: Developers should be transparent in their use of synthetic data when using it to train an AI model.
When the use of synthetic data—disclosed as part of a Data Declaration—is known to an AI developer’s potential partners, clients, and end users, it provides an appropriate level of accountability and quality control that would otherwise be missing.
Why does it matter? Because quality matters, and without basic knowledge of an AI system’s training data the users of that AI system have no way to gauge the trust they should put in the system. And these are systems that may affect the life or death of both individuals and entire companies.
To understand more of the known limitations of synthetic data used to train AI models, see this 2024 research paper: Synthetic Data in AI: Challenges, Applications, and Ethical Implications.