Input Safeguards: Require Transparency in AI Training Data

There is a common saying in data science: “Garbage in, garbage out.”

When it comes to artificial intelligence, data is everything. Whether an AI system’s outputs are useful and fair depends entirely on the data used to train it. Currently, AI developers are not required to disclose any information about the data used to train their AI systems.

That gives consumers zero information about the quality of the data used to create the system. It also allows developers to hide any potential unauthorized use of copyrighted works, private information, hate speech, disinformation, or other questionable data.

As a result, AI system customers cannot make informed decisions when purchasing AI products to help maintain their businesses, or when exchanging sensitive personal information for services that will ostensibly improve their quality of life.

At the Transparency Coalition, we believe transparency in AI training data is the foundation of ethical AI. State legislatures should consider measures that require developers of AI systems and services to publicly disclose specified information related to the datasets used to train their products.

California’s AB 2013, passed into law in Sept. 2024, is the first such measure adopted in the United States. In early 2025, more states will consider similar bills—establishing a common regulatory requirement and an appropriate industry standard.

Next:

Previous
Previous

AI Safeguards: Where to Start

Next
Next

TCAI Guide to AI Lawsuits