TCAI urges adoption of ‘Do Not Train’ data and Training Data Request prompts

Image by Reto Scheiwiller from Pixabay

People and organizations should be able to send a signal to indicate that their data may not be used to train artificial intelligence systems.

To that end, the Transparency Coalition is advocating for the adoption of two new and emerging digital frameworks: ‘Do Not Train’ data, and Training Data Request prompts.

If adopted as industry standards or required by legislation, these concepts would allow primary content owners to exercise their intellectual property rights with a minimum of burden.

What is a ‘do not train’ data designation?

Labeling a certain set of data as “Do Not Train” should prevent all AI developers from using the data to train AI models.

The effect of a Do Not Train (DNT) data designation would be similar to a copyright page in book publishing or a robots.txt file in web crawling. It announces: This material is off-limits for use as AI training data.

Most of us are familiar with copyright. A copyright page in a book lets the user know the work may be read, enjoyed, and passed along reader-to-reader, but it may not be reproduced or republished without the copyright holder’s consent.

Robots.txt is a digital version of copyright—sort of. A robots.txt file embedded within a website’s code tells search engines like Google and Bing what a website’s rules of engagement are. A robots.txt file tells search engines that specific parts of a websites are off-limits. They’re commonly used to restrict search engines from crawling a website’s admin files, for example.

Currently the AI community is coalescing around the use of “ai.txt” as an AI version of the robots.txt directive. An ai.txt file embedded in a website’s root directory allows or denies AI developers the use of a domain’s text or media files to train AI models.

An early version already exists

An early version of DNT data designation has been created by Spawning.ai, an independent third party AI governance start-up. Spawning maintains a Do Not Train registry and provide machine-readable opt-out tools for domain hosts.

Two of the world’s biggest AI developers, Stability and HuggingFace, have partnered with Spawning and agreed to honor their DNT registry. Stability is the creator of the generative AI image-creation system Stable Diffusion. HuggingFace is the world’s largest repository of models and datasets.

Starting in January 2025, many state legislatures will consider bills to enhance the safety and transparency of AI systems. We believe those proposals should require generative AI developers to honor the ‘DNT data’ designation, just as search engines today honor the directives of robotx.txt as they crawl the digital world.

what are ‘Training data request’ prompts?  

As part of our mission to create AI safeguards for the greater good, the Transparency Coalition is introducing two new command concepts designed to infuse generative AI systems with a new level of transparency.

Training Data Requests (TDRs) offer content creators, copyright owners, and individuals a basic level of agency in the use of their data property. We have developed two of these TDRs for consideration in AI-related bills during the upcoming 2025 state legislative sessions. They are:

  • Training Data Verification Request (TDVR)

    This is a mechanism by which a primary content owner submits a verified request to a generative AI developer to inquire if their content is included in the AI model’s training dataset.

  • Training Data Deletion Request (TDDR)

    This is a mechanism by which a primary content owner submits a verified request to a developer to delete content that was or will be included in a generative artificial intelligence training dataset.

Who are primary content owners?

Designating a dataset as ‘Do Not Train’ data may be done only by the owner of that specific data. Similarly, TDDRs may be sent only by primary content owners.

“Primary content owner” means a person, partnership, or company that owns, in full or part, digital data, content, or objects that are subject to copyright protection. The definition is also meant to include an individual with personally identifiable information (PII) whose PII has been included in the GenAI model’s training dataset.

Spawning’s search engine allows anyone to find out if their data has been used to train an AI model.

are training data requests technically feasible?

There are no technical reasons TDRs cannot be built into the current and future generation of AI systems.

Two TDVR search engines already exist for authors and artists, and we’ve linked to them on our Transparency Coalition Learn page. One was created by The Atlantic Monthly; it allows authors to determine whether their books were used to train AI models via the Books3 database, which was scraped without permission from the internet. Another was recently launched by Spawning. The Spawning “Have I Been Trained?” search engine allows artists, illustrators, and photographers to search LAION-5B, one of the most common image datasets used by image-generating GenAI systems.

Once a TDVR request confirms the presence of protected data or content, the primary content owner may choose to submit a Training Data Deletion Request (TDDR).

How a training data deletion request works

AI model architectures are regularly updated to remain competitive. Once trained on a certain dataset an AI model cannot “unlearn” the data it has ingested, so immediate deletion isn’t possible. But model updates are structural in nature and require the reprocessing of training data. These updates create ideal opportunities to implement and honor Training Data Deletion Requests (TDDRs). Timely implementation of a TDDR within a reasonable time horizon is well within the realm of feasibility today.

In an ideal world, the industrywide implementation of a Do Not Train data standard will provide a one-stop universal opt out signal for content owners. Short of that, a TDDR should allow content owners to manage their intellectual property more directly with a specific AI model developer.

With both a TDVR and a TDDR, we recommend including a requirement for AI developers to respond within 30 days of receiving either request.

 

For more technical information on TDRs please contact us and we will connect you with the Transparency Coalition’s subject matter experts. 

Previous
Previous

Former OpenAI researcher and whistleblower found dead at age 26

Next
Next

Harvard and Google to release 1 million public domain books as AI training data