Training Data: What the Machine Learns

Training data is the foundation of artificial intelligence. It’s what AI systems like ChatGPT use to provide answers to the prompts we provide. It’s what generative image systems like Midjourney and DALL-E use to conjure AI-created art. 

Training data is the information gathered across the internet used to create answers and images. Training data can take the form of Reddit, Wikipedia, personal social media pages, both free and paid media outlets — AI uses everything, to the tune of billions of pages each month for a given generative AI model. 

The specific nature, quality, and source of the massive training datasets used to train LLMs (large language models) like ChatGPT is a source of great controversy. OpenAI, the maker of ChatGPT, has refused to disclose any information about the data used to train what is effectively the world’s most influential AI chatbot. 

OpenAI has said its training data is proprietary information, a trade secret. Many people believe OpenAI refuses to disclose the nature of its training data because it could open the company to legal troubles, as its training datasets may or may not include copyrighted material and other legally protected intellectual property. (Dozens of lawsuits have been filed over these issues.) 

Concealing all information about training data has two immediate harmful effects: 

  • It hides the unauthorized and possibly illegal use of legally protected intellectual property. Which is long-winded way of describing theft. 

  • It allows no insight into the quality of the data used the train the AI model, and hence no information about the quality of the AI model itself. As the saying goes: “Garbage in, garbage out.” AI models trained on low-quality datasets will generate low-quality outputs. 

Without any transparency in information sourcing, the public will become increasingly reliant on a system that has no accountability. In addition to the risk of rampant misinformation, the mere awareness of its fallibility further erodes public trust in true information.

Previous
Previous

Transparency and Synthetic Data

Next
Next

Why and How to Disclose the Use of AI