Big win for AI transparency: California Gov. Newsom signs Training Data Act into law
Sept. 29, 2024 — In a landmark victory for AI transparency, Gov. Gavin Newsom signed the California Training Data Transparency Act into law over the weekend.
The Act, introduced earlier this year as AB 2013 by Assemblymember Jacqui Irwin (D-Thousand Oaks), will require developers of generative AI systems to post information about the data used to train their AI system on their websites.
The Act’s signing was hailed by Transparency Coalition.AI (TCAI) co-founders Rob Eleveld and Jai Jaisimha as a watershed moment in AI safety and transparency. AB 2013 was one of TCAI’s top priority measures in Sacramento this year, along with the California AI Transparency Act (SB 942), which was signed into law by Gov. Newsom earlier this month.
"We are excited to have partnered closely on California's AI Training Data Transparency Act with Assemblymember Jacqui Irwin and her team,” said Eleveld. “The Act aligns directly with Transparency Coalition's focus on providing more information about the data used to train AI models. The Training Data Transparency Act will require AI developers to publish a simple data nutrition label similar to what we see every day on cereal boxes.”
“Better control and curation of AI training data inputs leads to higher-quality AI system outputs,” Eleveld added. “That leads to AI models that deployers and the general public can trust. We look forward to expanding the standards and concepts set out in AB-2013 through our advocacy efforts in additional states, and in Congress, during the upcoming 2025 legislative session."
Daylighting the datasets used to train AI models
Currently, there is no legal obligation for generative AI developers to disclose any information about the data used to train the world’s most powerful AI models. Millions of consumers, and many of the world’s leading corporations, now use AI systems like ChatGPT without knowing anything about the basis for the chatbot’s outputs. OpenAI, the developer of ChatGPT, has never disclosed any information about the training data used to build the various generations of its market-leading model. That has led to a proliferation of lawsuits against the company filed by authors, publishers, artists, and media companies who allege that their copyright-protected works have been illegally used for financial gain by OpenAI and other developers.
Meanwhile, companies are under increasing pressure to adopt AI systems within their organizations, while knowing nothing about the training data that forms the basis for an AI model’s outputs, or answers.
The passage of AB 2013 represents the first time any legislative body in the United States has adopted a law requiring developers to publicly share the most basic information about an AI model’s training data.
How the new law works
Beginning on Jan. 1, 2026, any time a generative AI system or service is made available to Californians, the developer of that system must post on the developer’s website some basic information about the data used to train the AI system.
The law calls for “a high-level summary” of the datasets, including this information:
The source or owner(s) of the datasets
A description of how the datasets further the intended purpose of the AI system
The number of data points included in the datasets
A description of the types of data points within the datasets
Whether the datasets include data protected by copyright, trademark, or patent
Whether the datasets were purchased or licensed by the developer
Whether the datasets include personal information or aggregate consumer information
Whether the developer cleaned, processed, or modified the datasets
The time period during which the data in the datasets were collected
Because it’s a state law, the transparency requirements set down in the Training Data Transparency Act are only legally required for AI systems available to consumers within California. But once that training data ingredient information is published within California, it will effectively be available and known worldwide. So the new law contains local requirements with global ramifications.