It’s not too late to control your own content

Image by natefarr from Pixabay

So your content has been scraped and used to train an AI model. Is it too late to do anything about it?  

No, it’s not.  

As more and more content creators realize their intellectual property has been scraped and used to train the world’s largest AI models, we at the Transparency Coalition are hearing two common reactions: Outrage and resignation.    

Artists, writers, performers, and others are outraged that their proprietary works have been used to enrich the world’s largest tech companies, with no permission asked or license granted. Many creators, unfortunately, fall into a position of resignation. They assume that once an AI model has trained on their work, it’s too late to do anything about it.  

That isn’t so. Here’s why.  

AI models are constantly updated

An AI model is a massive computer program. It’s built with a certain digital architecture. Once built, the AI model must be trained—that is, fed an enormous amount of data. From that data—books, videos, music, news, web pages, social media posts—the AI model learns how to use language and gains a storehouse of (not always accurate) factual knowledge.  

AI models, like all software products, are constantly being updated. Just as Microsoft continues to release new and improved version of its Windows operating system, OpenAI continues to release new and improved versions of its ChatGPT AI chatbot.

Between those upgrade releases, data is being reprocessed and reweighted within the model all the time. AI model developers have ample opportunity to delete data during these routine maintenance operations.

When an AI model is officially updated—typically in a highly touted release—its digital architecture has been changed and the model must be re-trained entirely from scratch.

What’s in 3.5 may not be in 4.0

Therefore, if your proprietary content was scraped and used to train ChatGPT 3.5, it was entirely possible for OpenAI developers to remove your content from the training data for ChatGPT 4.0.  

New tools are emerging that allow content creators to designate their work as “Do Not Train” data. By doing so, creators are posting the digital equivalent of a No Trespassing sign, or a copyright declaration.  

Is your material still in the “mind” of ChatGPT 3.5? Yes. But as time passes, fewer and fewer people will use ChatGPT 3.5. Think of it like this: Windows 95 still exists but nobody uses it anymore.  

Your work may have been used to train ChatGPT 4.0 today. If you start designating your past, current, and future work as “Do Not Train” data, it will not be used to train future ChatGPT releases. Two years from now, nobody will be using ChatGPT 4.0. Everybody will have moved on to ChatGPT 7.5, or whatever.  

Is it too late for your work to be removed from today’s AI models? Yes. But there will be new models released tomorrow, and the next day, and the next. Those future versions will depend on training data, and that is where content creators have the power to take back control of their intellectual property.  

more resources for creators and other data owners

At the Transparency Coalition we’re constantly updating and adding to our database of information about AI, training data, and tech policy. Some relevant resources:

Next
Next

Former OpenAI researcher and whistleblower found dead at age 26