Why Training Data Is Not a Trade Secret
As the need for AI training data transparency gains traction in policy circles, some of the most powerful AI companies are pushing back. They argue that basic descriptive information about the data used to train AI models constitutes proprietary information. In other words, they claim it’s a trade secret.
In fact, it’s not.
The documentation that the Transparency Coalition and others seek to daylight is akin to the ingredients label we require from all food manufacturers. We’ve gathered some of the most common inquiries in the FAQ below.
Q: You’re asking AI developers to publish information about the data used to train their AI models. Isn’t that forcing them to divulge trade secrets?
A: No. Training data is not a trade secret. It’s a raw ingredient in the digital recipe.
Q: But doesn’t training data “make” the AI model?
A: The ingenuity in AI doesn’t lie in the datasets. Many of today’s AI systems were trained on the same problematic datasets like the Common Crawl, WebText2, Books1, and Wikipedia.
Q: If everybody’s using the same training data, where does the innovation lie?
A: It’s in the construction of the model itself. Developing an AI model requires months or years of work envisioning the system, lining up compute power, creating the algorithms, training the model, weighting the data, and building the end-user interface. Datasets are the flour, eggs, and water. The trade secrets lie in the preparation, the process, the heat, the time, and the chef’s skill.
Q: Still, we don’t require Coca-Cola to divulge the recipe for Classic Coke.
A: Actually, we do. Coca-Cola prints the ingredients on every can of Coke. Food safety laws strike an appropriate balance by requiring a truthful ingredients list while allowing the company to retain the art of mixing, heating, cooling, and proportioning as trade secrets. Digital transparency laws should strike a similar balance.
Q: If it’s not a trade secret, why would an AI developer balk at disclosing basic information?
A: Many of today’s generative AI systems were trained on raw, unlicensed data that may or may not have been legally obtained. Some of the largest AI developers are now facing lawsuits accusing them of mass copyright infringement.
Q: But haven’t there been recent announcements regarding AI data licensing deals with major outlets such as the Associated Press, NewsCorp, and the Financial Times?
A: Yes. The fact that those agreements were announced and publicized underscores the point that high-quality training data has great value—but it is not a trade secret.
Q: Why does government need to get involved?
A: Without an appropriate degree of required AI transparency, bad actors may hide their toxic training data behind the “trade secret” curtain, thereby undermining the good-faith efforts, innovation, and investments of ethical AI companies.
As with food ingredient labels, AI training data disclosures do not require companies to display their trade secrets. They only ask for a high-level description of the datasets.