How AI Systems Are Created

At its heart, an AI system is a highly sophisticated computer program. 

That program, known as a model, requires enormous amounts of computing power and massive datasets. By ingesting the datasets, the model “learns” about the structure of language, for instance, or patterns derived from millions of images. 

It’s worth noting that the term learn is imperfect and controversial. Many copyright infringement lawsuits filed by content creators against AI developers hinge on the nature of the training an AI model undergoes. AI developers claim their models learn from copyrighted material as a student learns from a textbook or image. Copyright holders claim the AI models illegally copy their protected content and retain the ability to reproduce that content exactly. More on that here.

Generative AI models like ChatGPT and Copilot use that training data to predict the probability of a given data point appearing in a particular space. A model might identify correlations like “things that look like bicycles usually have two wheels,” or “eyes are unlikely to appear above eyebrows.” When ChatGPT responds to a text prompt, it has computed a high probability of the sequence of words assembled in response to the prompt. 

While this may sound complex, the simple version is this: Generative AI is really just a giant pattern matching system, but with a memory thousands of times larger than Einstein’s brain. A generative AI system doesn’t invent something new, it only finds a pattern in its memory (from its training data) that’s the best statistical match for the chat session or image it is asked to produce.

The big concern is that when there is no curation of training data for a GenAI system, the pattern match it decides to output may be from Wikipedia, or it may be from an obscure Reddit chat room or a Russian troll farm. Even cutting edge researches don’t know exactly which of those choices the system makes. But to be clear, if the system is trained on CSAM or conspiracy theories, eventually those will show up in the output somewhere.

AI models will inevitably absorb the societal biases reflected in their training data. If not excised, this bias will perpetuate and exacerbate inequity in the field the model informs, like healthcare or hiring. This is why, for example, training an AI model on social media data can be problematic. 

Transparency about training data is the first step in quality assurance for an AI system. Knowing the source and nature of the system’s training data offers the deployer and the end user (consumer) a degree of assurance about the quality of the system’s output. This is why some AI development companies are now entering into business agreements with standard-bearing content providers like Reuters and the Financial Times, and publicizing those agreements. High-quality training data yields high-quality outputs. 

Next: 

Previous
Previous

Defining Artificial Intelligence

Next
Next

Why the AI Boom Is Happening Now