“OpenAI and Microsoft Called Upon by The New York Times to Compensate for Training Data”

The New York Times is suing OpenAI and its close collaborator (and investor), Microsoft, for allegedly violating copyright law by training generative AI models on Times’ content. Actress Sarah Silverman joined a pair of lawsuits in July that accuse Meta and OpenAI of having “ingested” Silverman’s memoir to train their AI models. As The Times’ complaint alludes to, generative AI models have a tendency to regurgitate training data, for example reproducing almost verbatim results from articles. And that’s why most [lawsuits like this] will probably fail.”Some news outlets, rather than fight generative AI vendors in court, have chosen to ink licensing agreements with them. In its complaint, The Times says that it attempted to reach a licensing arrangement with Microsoft and OpenAI in April but that talks weren’t ultimately fruitful.

The New York Times has filed a lawsuit against OpenAI and its close collaborator, Microsoft, for allegedly violating copyright law. The Times is accusing OpenAI and Microsoft of training their generative AI models using Times‘ content without proper consent.

In the lawsuit, filed in the Federal District Court in Manhattan, The Times is seeking for OpenAI and Microsoft to “destroy” any models and training data that contain the Times‘ content and to be held responsible for damages amounting to “billions of dollars.” The Times believes that their unique and valuable works have been unlawfully copied and used by OpenAI and Microsoft, which could have a significant impact on society.

“If The Times and other news organizations cannot produce and protect their independent journalism, there will be a vacuum that no computer or artificial intelligence can fill,” reads The Times‘ complaint. “Less journalism will be produced, and the cost to society will be enormous.”

In a statement sent via email, a spokesperson for OpenAI stated that they respect the rights of content creators and owners and are committed to working with them to ensure they benefit from AI technology and new revenue models. They also shared that their ongoing conversations with The New York Times have been productive and they are hopeful to find a mutually beneficial way to work together, just as they are doing with other publishers.

Generative AI models learn from examples to generate various types of content, including essays, code, emails, and articles. Vendors of AI models, such as OpenAI, often scrape the web for millions to billions of examples to add to their training sets. While some of these examples are in the public domain, others are not and require proper citation or compensation.

However, vendors argue that their practices of web scraping fall under fair use doctrine as a means of protection. Copyright holders, on the other hand, disagree, and many news organizations have taken measures to prevent OpenAI, Google, and other companies from scanning their websites for training data.

The conflict between vendors and news outlets has resulted in a series of legal battles, with The Times being the latest to join. Actress Sarah Silverman has also sued Meta and OpenAI in July for using her memoir to train their AI models without her consent. In another case, hundreds of novelists, including Jonathan Franzen and John Grisham, have accused OpenAI of using their work as training data without permission or knowledge. There is also an ongoing case against Microsoft, OpenAI, and GitHub over Copilot, an AI-powered code-generating tool, which the plaintiffs allege was developed using their protected code.

While The Times is not the first to sue generative AI vendors for using written works without proper consent, they are the largest publisher involved in such a suit to date. Their complaint also highlights the potential damage to their brand through “hallucinations” or false information generated by the AI models.

The complaint references several instances where Microsoft’s Bing Chat, now called Copilot, provided incorrect information that was attributed to The Times. One example includes a search for “the 15 most heart-healthy foods,” where 12 of the results were not mentioned in any Times article.

The complaint states, “Defendants seek to free-ride on The Times’s massive investment in its journalism.” The Times believes that OpenAI and Microsoft are essentially building news publisher competitors using The Times‘ works, which in turn harms their business by providing information that would typically require a subscription to access.

As mentioned in the complaint, generative AI models have a tendency to reproduce training data, often producing information that is almost identical to the input. In one instance, OpenAI inadvertently enabled ChatGPT users to bypass paywalled news content.

“The defendants are essentially using The Times‘ content without payment to create products that compete with and take audiences away from The Times,” the complaint states.

The impact of these AI models on the subscription business of news outlets and their web traffic is also at the center of another lawsuit filed against Google by publishers earlier this month. In this case, the defendants, like The Times, accuse Google of siphoning off their content, readers, and ad revenue through anticompetitive means using their GenAI experiments, including the AI-powered Bard chatbot and Search Generative Experience.

There is evidence to support the publishers’ claims, with one model from The Atlantic showing that if a search engine like Google were to integrate AI into search, it could answer a user’s query without requiring a click-through to the website 75% of the time. Publishers involved in the lawsuit estimate that they could lose up to 40% of their traffic.

However, the success of these lawsuits in court is not guaranteed. Heather Meeker, a founding partner at OSS Capital and an adviser on IP matters, including licensing arrangements, believes that The Times‘ example of regurgitation is similar to “using a word processor to cut and paste.”

“If the user intentionally makes the chatbot copy, that’s the user’s fault. And that’s why most lawsuits like this will probably fail,” Meeker explains.

Instead of court battles, some news outlets have chosen to negotiate licensing agreements with generative AI companies. In July, the Associated Press reached a deal with OpenAI, and this month, Axel Springer, the German publisher that owns Politico and Business Insider, did the same.

The complaint highlights that The Times attempted to negotiate a licensing arrangement with Microsoft and OpenAI in April, but the talks were unsuccessful.

This article was updated at 4:24 Eastern with additional context and comment from OpenAI.