Understanding Synthetic Data

Image by Dimuth De Zoysa from Pixabay

In today’s AI ecosystem there are two general types of training data: organic and synthetic.

Organic data describes information generated by actual humans, whether that’s a piece of writing, a numerical dataset, a song, an image, or a video. Synthetic data is created by generative AI models using organic data as a base material.

By ingesting and analyzing the organic data, the AI model learns the patterns, correlations and statistical properties of the organic data. Then the model can generate statistically identical “synthetic data” that looks and feels like the organic data, but without any of the organic data’s personal information.

A new generation of start-ups has emerged to convert organic data into synthetic data. Companies like Gretel AI, Tonic AI, and ExactData promise to generate “high-quality synthetic data that mimics real production data while preserving privacy.”

Within synthetic data there are subsets: structured and unstructured synthetic data. Synthetic images and video are classified as unstructured. Tabular data (like financial transaction records or CRM databases) are structured data, because the data points and their relationships are both important properties. Structured data describes human behavior in a chronological way; it’s also referred to as behavioral or time-series data.

synthetic data: useful, important, and imperfect

Synthetic data can be extremely useful in a number of situations.

It can lead to more accurate AI model predictions, and enable the simulation of various potential outcomes. A generative AI system can create synthetic patient data to train predictive models for disease diagnosis and treatment outcomes while protecting patient privacy. In the financial world, synthetic data can simulate market conditions to assess potential risks and optimize investment strategies. Synthetic data can also supplement existing datasets, particularly when dealing with limited data points or imbalanced classes, which improves the performance of an AI model.

In theory, the evolution of synthetic data should be a win-win: AI developers gain the ability to train new models on known datasets, while the anonymization of the data protects the personal information of individuals. This is especially important in the realm of healthcare, where patient data in the U.S. is closely protected by HIPAA. Synthetic data allows researchers to unlock mysteries and create new treatments while upholding patient confidentiality and privacy.

The main problem with synthetic data, from the AI developer’s standpoint, is that synthetic data mimics organic data but does not match the quality of organic data one-to-one.

further resources

Click on image to access this 2024 UNESCO report on synthetic data and AI policy.

Previous
Previous

Transparency Coalition report urges updating privacy laws to counter harms of Generative AI

Next
Next

Synthetic Data and AI ‘Model Collapse’