TCAI Guide to Search Tools: Was Your Data Used to Train an AI Model?

TCAI Resources

Nov 21

If you’ve published text, photos, illustrations, or any other organic content on the internet, it’s likely that material has been scraped into a dataset and used to train one or more of the LLMs (large language AI models) operating today.

While there’s no all-encompassing search tool to determine if your material has been scraped and used as training data, search engines have emerged recently that allow individuals to check specific types of content—books and images—for use as AI training data.

books

The Atlantic released the first of these search tools in Sept. 2023. Writer Alex Reisner acquired the Books3 dataset, which includes the complete text of more than 191,000 books used without permission to train generative AI systems created by Meta, Bloomberg, and others.

That unauthorized use of those books is now at the heart of a number of copyright lawsuits working their way through federal courts. (More on those court battles here.)

The Atlantic / Books3 database is searchable by author name:

Search the Books3 database

Note: The Atlantic’s search tool sits behind a paywall. You may need to pay content creators for their work. Which is kind of what this whole controversy is about.

images

One of the most innovative start-ups working on AI data governance is Spawning, an independent third-party group based in the United States. One of their co-founders is Holly Herndon, an international AI artist profiled in the New Yorker last year.

Spawning is developing tools to actively block data scrapers from accessing website data, and to keep data searchable while restricting its use for AI training.

As part of that effort they’ve released a freely accessible search tool, Have I Been Trained, that searches the LAION-5B training data set, a library of 5.85 billion images used to feed Stable Diffusion and Google’s Imagen.

search 'have i been trained' for images

We will update this list as more search tools become available.

How to stop AI from using your data

There are a number of resources to consult on the strategies to keep your organic content from being used to train AI models. They include:

WIRED: How to Stop Your Data From Being Used to Train AI

Spawning: Tools for Rights Holders

Stanford / Human-Centered Artificial Intelligence (HAI): How Do We Protect Our Personal Information?

There are more controversial tools that effectively turn data into “poison” when AI models try to use it. The most well-known are Glaze and Nightshade, created by Ben Zhao and the SAND Lab at the University of Chicago. For more on that:

MIT Technology Review: Nightshade Lets Artists Fight Back

U. of Chicago: What is Nightshade?

U. of Chicago: All About Glaze

Opting out of major platforms

LinkedIn (Microsoft)

LinkedIn is currently using the employment data you’re posting to train AI models. To opt out, log into your LinkedIn account, click on your headshot, and select Resources. Then select Personal Demographic Information. Then select Data Privacy. Under “How LinkedIn Uses Your Data,” select “Data For Generative AI Improvement.” Move the slider to Off.

Flipping that switch will prevent the company from feeding your data to its AI, with a caveat: The results aren’t retroactive. LinkedIn says it has already begun training its AI models with user content, and that there’s no way to undo it.

Microsoft Office suite (Microsoft)

Microsoft has begun scraping MS Office documents to train its AI models. Everyone using Office tools (Word, Excel, PowerPoint, et al) is automatically opted-in unless they take steps to opt-out.

To opt out from a Windows computer:
Open Word or any Office application.
Go to File—>Options—>Trust Center (left panel)—>Trust Center Settings (button)—>Privacy Options (left panel)—>Privacy Settings (button), then uncheck “Turn on optional connected experiences".

Once you confirm by clicking OK, close all Office applications (Word, Excel, Outlook, et al) and reopen them for the changed setting to take effect.

ChatGPT (OpenAI)

When you interact with ChatGPT, OpenAI’s generative AI chatbot, the company is using your prompts to train and improve its own AI models. Here’s how to opt out.

Web controls (as a logged in user):

To disable model training, navigate to your profile icon on the bottom-left of the page, select Settings > Data Controls, and disable “Improve the model for everyone." While this is disabled, new conversations won’t be used to train OpenAI’s models.

Web controls (as a logged out user):

To disable model training, navigate to the ? icon on the bottom-right of the page, select
Settings > Data Controls, and disable "Improve the model for everyone."

iOS app:

Tap the three dots on the top right corner of the screen > Settings > Data Controls > toggle off “Improve the model for everyone.”

Android app:

Open the menu through the three horizontal lines in the top left corner of your screen, select Settings > Data Controls, and toggle off “Improve the model for everyone.”

Adobe platforms (Adobe)

Adobe has changed its terms of service to grant the company the right to use your artwork to train their AI models. This means any content you create on their platforms (Illustrator, Photoshop, et al) could be used to train generative AI models that could devalue or replace your work.

Hat-tip to Nashville-based artist Ginny St. Lawrence, who posted this information on her blog.

Take these steps to disable the AI-related features that analyze your work and use it to train Adobe’s current and future AI systems:

Go to the Adobe Account Privacy Settings page.
Locate the "Content analysis" setting and turn it off.
This will prevent Adobe from using your artwork to train their AI models.

It's important to note that this option is not available for Adobe Stock contributors, as they have a separate agreement with the company. Additionally, Photoshop users may need to turn off an additional feature within the application itself.

Facebook (Meta)

Note: Meta has extended the following opt-out option to European users only, as it it compelled to do so by the EU’s General Data Protection Regulation (GDPR). Meta does not offer an opt-out option for AI scraping to consumers in the United States and Australia, which have no similar legal requirements.

Log in to your Facebook account. Go to “Settings and Privacy,” select Privacy Center, and then find “How Meta uses information for generative AI models and features.” Scroll down and click “Right to Object.”

Complete the form with your personal details. You’ll be asked to explain why you want to opt-out. A simple statement asserting your right to object under GDPR should suffice. Be sure to confirm your email address.
Wait for confirmation. After submitting the form, you’ll receive an email and a notification on Facebook confirming whether your request has been successful.

Instagram (Meta)

Note: This opt-out option is available to EU consumers only. See note under Facebook above.

Log in to Instagram and go to your profile. Tap the three lines in the top-right corner and select “Settings and Privacy.”

Scroll down to “More Info and Support” and select About, then select “Privacy Policy.” At the top of the page, you’ll find the link labeled “Learn more about your right to object.” Select that link.
Submit the form as you would for Facebook, and you’ll receive a notification and email confirmation once your request is processed.

Bruce Barcott

TCAI Guide to Search Tools: Was Your Data Used to Train an AI Model?

books

images

How to stop AI from using your data

Opting out of major platforms

What is a ‘duty of care’ and how does it apply to artificial intelligence?

AI Safeguards: Where to Start