OpenAI on Monday said that The New York Times (NYT) is not telling the full story about the lawsuit it filed against the Sam Altman-led company and Microsoft on December 27.
“Interestingly, the regurgitations The New York Times induced appear to be from years-old articles that have proliferated on multiple third-party websites. It seems they intentionally manipulated prompts, often including lengthy excerpts of articles, in order to get our model to regurgitate,” OpenAI wrote in a blog post.
As part of the lawsuit, the NYT submitted approximately 100 examples of copyright violations that showcase ChatGPT or its underlying model returning pieces of text that are nearly identical to paragraphs published as part of NYT articles or editorial content.
However, OpenAI has claimed that even when “manipulated” prompts are used, its models “don’t typically behave the way The New York Times insinuates, which suggests they either instructed the model to regurgitate or cherry-picked their examples from many attempts.”
OpenAI said the examples put forth by NYT are not typical examples of misuse or allowed user activity. It noted that the generated texts are not a substitute for the prestigious newspaper.
Ian Crosby, partner with Susman Godfrey and lead counsel for The New York Times said the OpenAI blog concedes that OpenAI used The Times’s work, along with the work of many others, to build ChatGPT. “As The Times’s complaint states, ‘Through Microsoft’s Bing Chat (recently rebranded as ‘Copilot’) and OpenAI’s ChatGPT, Defendants seek to free-ride on The Times’s massive investment in its journalism by using it to build substitutive products without permission or payment.’ That’s not fair use by any measure.”
OpenAI working on solving the regurgitation issue
The Sam Altman-led company said it has identified and is working on solving the “regurgitation” issue of ChatGPT, which it terms as “memorization” and said is a failure of the model training process.
Memorization, according to the company, tends to happen more commonly when particular content appears more than once in training data, in this case, NYT’s articles appearing on other websites as well.
“So we have measures in place to limit inadvertent memorization and prevent regurgitation in model outputs. We also expect our users to act responsibly; intentionally manipulating our models to regurgitate is not an appropriate use of our technology and is against our terms of use,” the company wrote in the blog post.
Experts argue over copyright claims
While there has been a lot of commentary about the NYT lawsuit against OpenAI, several technology innovators seem to be sympathizing with OpenAI’s logic.
“After reading the @nytimes lawsuit against @OpenAI and @Microsoft, I find my sympathies more with OpenAI and Microsoft than with the NYT,” Andrew Ng, one of the leading scientists in the field of AI wrote on X, formerly Twitter.
Ng claimed that just as humans are allowed to read documents on the open internet, learn from them, and synthesize brand-new ideas, AI should be allowed to do so too.
“I would like to see training on the public internet covered under fair use — society will be better off this way — though whether it actually is will ultimately be up to legislators and the courts,” the AI scientist explained in DeepLearning.AI’s weekly newsletter.
Somewhat supporting OpenAI’s claims, Ng further said that the examples of violations put forth by NYT occurred due to a RAG-like mechanism where the user prompt causes the system to browse the web, retrieve a specific article, and then print it out.
Systems architect Daniel Jeffries also took to Twitter to explain why the Times case has a near-zero probability of winning and somewhat supported OpenAI’s claims.
Jeffries was reacting to Jason Klint’s post on Twitter, which argued that the Times case was more likely to win. Klint is the CEO of Digital Content Next, a trade association for content companies.
The systems architect also pointed out that the Times case may go the same way the Sarah Silverman case went, wherein a US district judge had ruled that determining whether generated images may be in direct violation of copyright laws was “not plausible” at the moment.
OpenAI has already stated that it believes training AI models using publicly available internet materials is fair use. This practice, according to the company, is supported by long-standing and widely accepted precedents.
Before the NYT filed the lawsuit, OpenAI said it was holding negotiations on a deal with the NYT through December 19.
“The negotiations focused on a high-value partnership around real-time display with attribution in ChatGPT, in which The New York Times would gain a new way to connect with their existing and new readers, and our users would gain access to their reporting,” it wrote in the blog post, adding that it had already explained to the Times that their content didn't meaningfully contribute to the training of its existing models and also wouldn't be sufficiently impactful for future training.
(The story has been updated with a comment from The New York Times)