Summary
This highly publicised case has been brought by The New York Times against Microsoft and OpenAI in the US District Court Southern District of New York, relating to ChatGPT (including associated offerings), Bing Chat and Microsoft 365 Copilot. It follows a period of months during which the NYT said it attempted to reach a negotiated agreement with Microsoft/OpenAI.
The Complaint raises arguments of large-scale commercial exploitation of NYT content, through the training of the relevant models (including GPT-4 and the next generation GPT-5), noting that the GPT LLMS have also 'memorized' copies of many of the woks encoded into their parameters. There are extensive exhibits (69 exhibits, comprising around 2000 pages) attached to the Complaint. Exhibit J in particular contains 100 examples of output from GPT-4 (as a 'small fraction') based on prompts in the form of a short snippet from the beginning of an NYT article. The example outputs are said to recite NYT content verbatim (or near-verbatim), closely summarise it, and mimic its expressive style (and also wrongly attribute false information - hallucinations - to NYT).
The Complaint also focuses on synthetic search applications built on the GPT LLMs which display extensive excepts or paraphrases of contents of search results, including NYT content, that may not have been included in the model's training set (noting that this contains more expressive content from the original article than would be the case in a traditional search result, and without the hyperlink to the NYT website).
The claims are for direct copyright infringement, vicarious copyright infringement, contributory copyright infringement, DMCA violations, unfair competition by misappropriation, and trade mark dilution.
On 26 February 2024, OpenAI filed a Motion to Dismiss in relation to parts of the claim to direct copyright infringement (re conduct occurring more than 3 years ago), as well as the claims relating to contributory infringement, DMCA violations and state common law misappropriation. In particular, OpenAI alleges that the 'Times paid someone to hack OpenAI's products' and that it took 'tens of thousands of attempts to generate the highly anomalous results' in Exhibit J to the Complaint, including by targeting and exploiting a bug (which OpenAI says it has committed to addressing) in violation of its terms of use. OpenAI goes on to categorise the key dispute in the case as to whether it is fair use to use publicly accessible content to train generative AI models to learn about language, grammar and syntax, and to 'understand the facts that constitute humans' collective knowledge'. The New York Times has categorised OpenAI's motion as grandstanding, with an attention-grabbing claim about 'hacking' that is both irrelevant and false.
Microsoft filed its Motion to Dismiss parts of the claim on 4 March 2024 focusing on (1) the allegation that Microsoft is contributorily liable for end-user infringement (2) violation of DMCA copyright management information and (3) state law misappropriation torts. Drawing an analogy with earlier disruptive technologies, the Motion states "copyright law is no more an obstacle to the LLM than it was to the VCR (or the player piano, copy machine, personal computer, internet, or search engine)"- its point is that the US Supreme Court has previously rejected liability merely based on offering a multi-use product that could be used to infringe. It further states that Microsoft "looks forward to litigating the issues in this case that are genuinely presented, and to vindicating the important values of progress, learning and the sharing of knowledge".
The Plaintiffs filed an Amended Complaint on 12 August 2024 (the amendments add a further approximately 7 million works to the suit).
The case has been consolidated with The Daily News complaint and also with the claim brought by The Center for Investigative Reporting.
OpenAI and Microsoft have filed Motions to Compel The New York Times to produce documents relating to the valuation of and market/s for the works in suit including in relation to subscription losses, effects on advertising revenue, web traffic documentation and high level economic and financial reporting data. These documents are said to be critical to assessing the fourth fair use factor, the effect of the use upon the potential market for or value of the copyrighted work.
Impact
The opening words of the complaint stress the importance of independent journalism for democracy - and the threat to the NYT's ability to provide that service by the use of its works to create AI products. It further highlights the role of copyright in protecting the output of news organisations, and their ability to produce high quality journalism.
The NYT website is noted in the Complaint as being the most highly represented proprietary source of data in the Common Crawl dataset, itself the most highly weighted dataset in GPT-3. Given the previous attempt at negotiations referred to in the complaint, it will be interesting to see if the launch of this complaint will lead to more fruitful licence negotiations, or whether this case will continue to trial (in which case, it should be tracked alongside the other complaints against OpenAI and Microsoft).
OpenAI's position is that 'training data regurgitation' (or memorisation) and hallucination are 'uncommon and unintended phenomena'. Memorisation is a problem that OpenAI say that they are working hard to address, including through sufficiently diverse datasets. Meanwhile, it points to its partnerships with other media outlets.