Harvard and Google to release 1 million public domain books as AI training data

Dec 12

AI training data has a big price tag, one best-suited for deep-pocketed tech firms. This is why Harvard University plans to release a dataset that includes in the region of 1 million public-domain books, spanning genres, languages, and authors including Dickens, Dante, and Shakespeare, which are no longer copyright-protected due to their age.

The new dataset isn’t available yet, and it’s not clear when or how it will be released. However, it contains books derived from Google’s longstanding book-scanning project, Google Books, and thus Google will be involved in releasing “this treasure trove far and wide.”

Read the full TechCrunch story here.

featured

Bruce Barcott

Harvard and Google to release 1 million public domain books as AI training data

TCAI urges adoption of ‘Do Not Train’ data and Training Data Request prompts

‘We’re starting to see other states act’ on AI, Jai Jaisimha tells Shift podcast