How to Format and Require Data Disclosures
Developers of AI systems should be required to provide documentation for all training data used in the development of an AI model.
This type of auditable information set provides transparency and assurance to deployers, consumers, and regulators. It’s similar to the SOC 2 reports that are standard in the cybersecurity industry. SOC 2 reports, issued by third-party auditors, assess and address the risks associated with software or technology services.
A Data Declaration is not necessarily tied to government oversight. Rather, we believe it should become a standard component of every AI model—expected and demanded by AI system deployers as a transparent mark of quality and legal assurance.
The Transparency Coalition’s Data Declaration template is available below. Please feel free to download it and use it as you please.
FIELD NAME | POSSIBLE VALUES |
---|---|
Data Set Name | Text |
Data Set Owner | Text |
Data Set Description | Text |
Data Set Size | Numerical |
Data Set Category | Web text, images, music, video, books |
Data Set License Type | Commercial License, Proprietary, Public Domain, Fair Use claim |
Data Set License Name | e.g. GPL, Apache, Creative Commons |
Data Set Collection Period | Start date, End date (or Present) |
Data Set Usage Period | Start date, End date (or Present) |
Data Set contains personal or personally identifiable information | Yes or No |
Personal Information Opt-in obtained | Yes or No |
Personal Information License mechanism | EULA, Terms of Service, Privacy Policy, Click Through |
Personal Information anonymized prior to training | Yes or No |
Data Set contains Copyrighted Information | Yes or No |
License governing Copyrighted Information | Fair use, Commercial license |
Synthetic Training Data use | Yes or No |
The need for legislation
Today there exists a gap between big tech developers concealing their training data and a future in which AI developers post training data “ingredients lists” and third-party auditors verify the provenance and quality of that data.
Legislation bridges that gap.
One of the first training data declaration requirements will soon go into effect in California. That state’s Training Data Transparency Act (AB 2013), passed and signed in Sept. 2024, requires developers of generative AI systems to publicly post, on their websites, a high-level summary of the datasets used to train that system. Those summaries will be legally required beginning on Jan. 1, 2026.
Having a legal requirement is critical because it weeds out the bad actors. Publishing a high-level data description is easy—but it must be truthful. A developer who practices deception with an AI training data description would be breaking the law. That gives consumers and competitors an avenue for enforcement, keeping the data clean by penalizing cheaters. As it stands today, there are no negative repercussions for developers who use unlicensed, inaccurate, or biased data.
By requiring the digital version of an ingredient label for AI systems, state and federal policymakers will encourage innovation by ethical AI companies, expose the market’s bad actors, and reward both companies and consumers with a safe, fair, and competitive marketplace.