FAQ: Is AI training data a trade secret?
Americans depend on ingredients labels for food safety.
They deserve the same from their AI.
As the need for AI training data transparency gains traction in policy circles, some of the most powerful AI companies are pushing back. They argue that basic descriptive information about the data used to train AI models constitutes proprietary information. In other words, they claim it’s a trade secret.
In fact, it’s not.
The documentation that the Transparency Coalition and others seek to daylight is akin to the ingredients label we require from all food manufacturers. We’ve gathered some of the most common inquiries in the FAQ below.
FAQ: AI Training Data Disclosures
Q: You’re asking AI developers to publish information about the data used to train their AI models. Isn’t that forcing them to divulge trade secrets?
A: No. Training data is not a trade secret. It’s a raw ingredient in the digital recipe.
Q: But doesn’t training data “make” the AI model?
A: The ingenuity in AI doesn’t lie in the datasets. Many of today’s AI systems were trained on the same problematic datasets like the Common Crawl, WebText2, Books1, and Wikipedia.
Q: If everybody’s using the same training data, where does the innovation lie?
A: It’s in the construction of the model itself. Developing an AI model requires months or years of work envisioning the system, lining up compute power, creating the algorithms, training the model, weighting the data, and building the end-user interface. Datasets are the flour, eggs, and water. The trade secrets lie in the preparation, the process, the heat, the time, and the chef’s skill.
Q: Still, we don’t require Coca-Cola to divulge the recipe for Classic Coke.
A: Actually, we do. Coca-Cola prints the ingredients on every can of Coke. Food safety laws strike an appropriate balance by requiring a truthful ingredients list while allowing the company to retain the art of mixing, heating, cooling, and proportioning as trade secrets. Digital transparency laws should strike a similar balance.
We know less about the ingredients in AI systems than we do about what’s in Coke or Corn Flakes.
Q: But we ingest food. Why should we care what’s in the training data?
A: “Garbage in, garbage out” is the most famous truism in tech. Today’s AI systems operate with zero transparency. Some developers use high-quality, legally licensed data. Others use unlicensed data scraped without a care for its source, legality, accuracy, or bias. This leads to AI chatbots advising people to pour Elmer’s glue on their pizza. Those same AI systems are involved in life-changing decisions in medical care, insurance, employment, access to education and housing, critical infrastructure, and our judicial process.
Data transparency respects copyright holders, protects data privacy, and rewards companies developing ethical, high-quality AI systems.
Q: If it’s not a trade secret, why would an AI developer balk at disclosing basic information?
A: Many of today’s generative AI systems were trained on raw, unlicensed data that may or may not have been legally obtained. Some of the largest AI developers are now facing lawsuits accusing them of mass copyright infringement.
Q: But haven’t there been recent announcements regarding AI data licensing deals with major outlets such as the Associated Press, NewsCorp, and the Financial Times?
A: Yes. The fact that those agreements were announced and publicized underscores the point that high-quality training data has great value—but it is not a trade secret.
Q: Why does government need to get involved?
A: Without an appropriate degree of required AI transparency, bad actors may hide their toxic training data behind the “trade secret” curtain, thereby undermining the good-faith efforts, innovation, and investments of ethical AI companies.
As with food ingredient labels, AI training data disclosures do not require companies to display their trade secrets. They only ask for a high-level description of the datasets.
Having a legal requirement is critical because it weeds out the bad actors. Publishing a high-level data description is easy—but it must be truthful. A developer who practices deception with an AI training data description would be breaking the law. That gives consumers and competitors an avenue for enforcement, keeping the data clean by penalizing cheaters. As it stands today, there are no negative repercussions for developers who use unlicensed, inaccurate, or biased data.
By requiring the digital version of an ingredient label for AI systems, state and federal policymakers will encourage innovation by ethical AI companies, expose the market’s bad actors, and reward both companies and consumers with a safe, fair, and competitive marketplace.