HarperCollins, Microsoft AI deal sets first public price for training data

Image by Mirko Stödter from Pixabay

A number of AI developers have struck deals with large copyright holders in the past few months in an effort to gain legal access to the high-quality datasets needed to train AI models. The terms of those agreements have remained private, however. That begged the question: Training data clearly has value—but how much, exactly, in real dollar terms?

We got the first answer last week when details of a training data deal between Microsoft and the book publisher HarperCollins, a subsidiary of News Corp, became public.

terms of the deal

The agreement calls for Microsoft to pay a fee of $5,000 per title, split 50-50 between the author and HarperCollins. The deal allows the AI developer to use the data for a period of three years. Payment would be made directly to the author and would not count against a title’s advance. That’s significant, as the majority of published books do not earn out their advances. “Direct payment” means an author would actually receive the AI fee and not merely have it registered against earn-out status on a balance sheet.

The deal also gives authors opt-in rights. HarperCollins writers who consent to the deal will receive the payout; those who decline will not have their work in the training datasets and will not receive the payment. Not all HarperCollins authors will be offered the deal. Microsoft is allowed to pick and choose the titles it wants to include in the training data.

The Authors Guild noted that the deal includes guardrails against users of the AI system generating outputs that could harm the value of the publisher’s books, including limiting outputs to no more than 200 consecutive words and/or 5% of a book’s text. “Other protections,” wrote the Guild, “include a pledge by the AI licensee not to scrape text from piracy websites – an illegal practice that harms authors — and to take action against infringement. These kinds of limitations and conditions on the use of the licensed material are crucial in AI training licenses to prevent the AI from stealing training data and generating harmful outputs.”

That pledge to not scrape text from piracy websites only applies to Microsoft itself as the licensee. That’s an important note, because Microsoft is a major investor in OpenAI, which is not a party to the HarperCollins agreement. Earlier this year Microsoft announced that Copilot, the company’s AI assistant in Windows 11 and Bing, would incorporate OpenAI’s GPT-40 technology. OpenAI’s past and current technology is widely believed to have trained on massive scraped internet datasets. OpenAI has never disclosed information about its training sets.

Value set: $1,667 per title, per year

Interestingly, the HarperCollins agreement was set using broad strokes. The data in books could conceivably be broken down into per-page, or per-word rates. The average adult trade title runs about 275 pages and 75,000 words. Children’s books may contains mere dozens of words, while major biographies can run to 500 pages.

Using the 275 / 75,000-word average, the HarperCollins deal sets the price at:

  • $1,667 per book title, per year

  • $6 per page, per year

  • $ 0.022 per word, per year

And, of course, the publisher and the author split these payments 50-50.

Initial reactions

The Authors Guild, which filed a lawsuit against OpenAI last year for copyright infringement, noted that the agreement represented a step in the right direction, as “the licensed use of books must replace AI companies’ current unlicensed, uncontrolled, and infringing use” of scraped internet databases. “We are cognizant of the resources and time that HarperCollins is investing in building an ethical licensing framework.” 

The Guild was less impressed with the terms of the deal for authors. They wrote:

We believe that a 50-50 split for a mere AI training license gives far too much to the publisher. These rights belong to the author as they are not book or excerpt rights; it is the authors’ expression that produces value in AI licensing. Even when the publisher is serving as the licensor on behalf of its authors, the authors should receive most of the revenue, minus only the equivalent of an agent’s fee, plus what is needed to compensate the publisher for additional labor or rights, such as creating the files that are licensed and providing metadata—and that is to be negotiated between the publisher and the author or their agent.

Publishers move to raise protections against ai use

The new deal comes as publishers make moves to protect their investments against non-consensual use by AI developers.

Last month Penguin Random House, the world’s largest trade publisher, changed the wording on its copyrights pages to protect authors’ intellectual property from being used to train large language models (LLMs) and other artificial intelligence (AI) tools. The Bookseller reported that Penguin Random House (PRH) amended its wording across all imprints globally,. The new wording states: “No part of this book may be used or reproduced in any manner for the purpose of training artificial intelligence technologies or systems.”

previous training data deals with media outlets

News Corp, the parent company of HarperCollins, struck a deal with OpenAI earlier this year, allowing the AI company to train its models on News Corp’s digital outlets, including The Wall Street Journal, the New York Post, The Daily Telegraph, and other publications.

In April, the Financial Times reached agreement with OpenAI to train artificial intelligence models on the publisher’s archived content. The deal allows OpenAI to use the publisher’s material to help develop generative AI technology. The agreement also allows ChatGPT to respond to questions with short summaries from FT articles, with links back to FT.com.

learn more about training data and books

To learn more about training data, copyright law, and datasets like Books3, see:

The Transparency Coalition advocates for the disclosure of training data to protect the property of creators and copyright holders, to enhance the quality of AI systems, and to hold AI developers accountable for the ethical sourcing of training data. Learn more about our ideas here.

Previous
Previous

Study of ChatGPT citations finds ‘spectrum of inaccuracy’

Next
Next

FTC Commissioner concerned about AI collecting children’s data