In late December, The New York Times sued OpenAI and its close collaborator and investor, Microsoft, for allegedly violating copyright law by training generative AI models on Times’ content.
In their response, posted this afternoon to OpenAI’s blog, the company restates their belief that using publicly available data from the web to train AI models, including news articles like those from The New York Times, falls under fair use. Essentially, OpenAI argues that they do not need to obtain a license or pay for the use of examples when creating generative AI systems such as GPT-4 and DALL-E 3. These systems “learn” from millions of examples of various forms of media, including artwork, ebooks, essays, and other types of text and images, in order to generate human-like output.
In their response, OpenAI also addresses the issue of regurgitation, a phenomenon where generative AI models output training data in a nearly identical manner after being prompted. For example, a model may generate a photo that is identical to one taken by a photographer. OpenAI argues that regurgitation is less likely to occur when using training data from a single source, such as The New York Times, and places responsibility on users to act ethically and avoid intentionally prompting the models to regurgitate, which is against their terms of use.
“Interestingly, the regurgitations The New York Times cite in its lawsuit appear to be from years-old articles that have proliferated on multiple third-party websites,” OpenAI writes. “It seems they intentionally manipulated prompts, often including lengthy excerpts of articles, in order to get our model to regurgitate. Even when using such prompts, our models don’t typically behave the way The New York Times insinuates, which suggests they either instructed the model to regurgitate or cherry-picked their examples from many attempts.”
OpenAI’s response comes as the debate over copyright in relation to generative AI reaches a fever pitch.
In a piece published this week in IEEE Spectrum, noted AI critic Gary Marcus and Reid Southen, a visual effects artist, demonstrated how AI systems, including DALL-E 3, output data in a regurgitated manner even when not specifically prompted to do so. This brings into question the credibility of OpenAI’s claim to the contrary. Marcus and Southen also reference The New York Times lawsuit in their piece, noting that The Times was able to elicit “plagiaristic” responses from OpenAI’s models simply by providing the first few words from a Times story.
The New York Times is just the latest copyright holder to sue OpenAI over what it believes is a clear violation of intellectual property rights. In July, actress Sarah Silverman joined a pair of lawsuits that accuse Meta and OpenAI of “ingesting” her memoir to train their AI models. In another case, thousands of novelists, including Jonathan Franzen and John Grisham, claim that OpenAI used their work as training data without their permission or knowledge. Additionally, several programmers have an ongoing case against Microsoft, OpenAI, and GitHub over Copilot, an AI-powered code-generating tool, which they argue was developed using their IP-protected code.