How to Format an AI Training Data Declaration

Training Data DisclosuresData Declaration

Oct 14

Developers of AI systems should be required to provide documentation for all training data used in the development of an AI model.

This type of auditable information set provides transparency and assurance to deployers, consumers, and regulators. It’s similar to the SOC 2 reports that are standard in the cybersecurity industry. SOC 2 reports, issued by third-party auditors, assess and address the risks associated with software or technology services.

A Data Declaration is not necessarily tied to government oversight. Rather, we believe it should become a standard component of every AI model—expected and demanded by AI system deployers as a transparent mark of quality and legal assurance.

The Transparency Coalition’s Data Declaration template is available below. Please feel free to download it and use it as you please.

FIELD NAMEPOSSIBLE VALUES
Data Set NameText
Data Set OwnerText
Data Set DescriptionText
Data Set SizeNumerical
Data Set CategoryWeb text, images, music, video, books
Data Set License TypeCommercial License, Proprietary, Public Domain, Fair Use claim
Data Set License Namee.g. GPL, Apache, Creative Commons
Data Set Collection PeriodStart date, End date (or Present)
Data Set Usage PeriodStart date, End date (or Present)
Data Set contains personal or personally identifiable informationYes or No
Personal Information Opt-in obtainedYes or No
Personal Information License mechanismEULA, Terms of Service, Privacy Policy, Click Through
Personal Information anonymized prior to trainingYes or No
Data Set contains Copyrighted InformationYes or No
License governing Copyrighted InformationFair use, Commercial license
Synthetic Training Data useYes or No

The need for legislation

Today there exists a gap between big tech developers concealing their training data and a future in which AI developers post training data “ingredients lists” and third-party auditors verify the provenance and quality of that data.

Legislation bridges that gap.

One of the first training data declaration requirements will soon go into effect in California. That state’s Training Data Transparency Act (AB 2013), passed and signed in Sept. 2024, requires developers of generative AI systems to publicly post, on their websites, a high-level summary of the datasets used to train that system. Those summaries will be legally required beginning on Jan. 1, 2026.

Having a legal requirement is critical because it weeds out the bad actors. Publishing a high-level data description is easy—but it must be truthful. A developer who practices deception with an AI training data description would be breaking the law. That gives consumers and competitors an avenue for enforcement, keeping the data clean by penalizing cheaters. As it stands today, there are no negative repercussions for developers who use unlicensed, inaccurate, or biased data.

By requiring the digital version of an ingredient label for AI systems, state and federal policymakers will encourage innovation by ethical AI companies, expose the market’s bad actors, and reward both companies and consumers with a safe, fair, and competitive marketplace.

Previous:
Training Data: What the Machine Learns
FAQ: Why Training Data is not a Trade Secret

Training Data Disclosures

Bruce Barcott

How to Format an AI Training Data Declaration

The need for legislation

Emerging Standards in Disclosure

Legislating the Disclosure of AI Use