US Copyright Office Report Part 3: Generative AI training

As generative AI continues to advance rapidly, it is raising fundamental questions at the intersection of technology and creative rights. Central to the debate is how the law should apply to the use of massive datasets, often containing copyright protected works, to train the powerful generative AI systems. The US Copyright Office (USCO) has considered this complex issue in depth as part of its ongoing comprehensive study on Copyright and Artificial Intelligence. This effort has produced a series of reports, the most recent of which, Part 3: Generative AI Training, specifically addresses the use of copyrighted materials in developing these sophisticated AI models.

However, the release of this 108-page analysis came at a turbulent moment for the USCO. Issued as a "pre-publication version", the report appeared in the midst of unexpected leadership changes. Just the day before its public release, the Librarian of Congress, Dr. Carla Hayden, was abruptly dismissed. Meanwhile, within a day of the report going live, the Register of Copyrights, Shira Perlmutter, was also removed from her position. This turmoil demonstrates the political scrutiny surrounding the issue.

Consequently, questions linger regarding the ultimate official status of this report. It remains uncertain whether it will stand as the USCO's final position or if future leadership might revisit and revise its contents. Despite this political fallout and uncertainty, however, the report's detailed analysis and conclusions are already shaping discussions and may influence various pending cases before the US courts.

Infringement

The report clarifies that creating and deploying a generative AI system using copyright-protected material involves multiple actions which, without a licence or applicable defence, such as fair use, may constitute prima facie infringement under the US Copyright Act. These include:

1. Data collection and curation

Compiling a training dataset using copyrighted works "clearly implicates the right of reproduction", according to the report. This involves downloading, storing, filtering, cleaning, and compiling data, potentially creating multiple copies of documents or works. The report notes that developers may abridge, rewrite, or augment works during curation, which could also implicate the right to prepare derivative works. Removing text or metadata related to the author or owner during cleaning might also raise issues concerning copyright management information.

2. Training

The training process itself also implicates the right of reproduction. The report considers whether a model's resulting "weights" (or parameters) are themselves infringing copies if the model implicitly stores knowledge or substantially similar material from the training data. While this is debated, the report concludes that, where models output copies that are substantially similar to training data inputs, the model's weights may infringe the right of reproduction.

3. Retrieval-Augmented Generation (RAG)

This technique involves the AI system retrieving relevant material from a database (often populated by copies of copyrighted works) to enhance its response to a user query. The report concludes that this process also involves the reproduction of copyrighted works.

Fair use defence

The report dedicates significant attention to the fair use doctrine, the primary defence that AI companies are relying upon in the various cases before the US courts. The report emphasises that the fair use analysis is fact-dependent and requires an assessment in the context of the overall use, not just intermediate steps considered in isolation.

Factor 1: purpose and character of the use

The report highlights that different uses during AI development and deployment require separate consideration, although they should be evaluated with reference to the broader purpose of copying.

A central issue is whether the use is 'transformative'. A use is transformative if it adds something new, with a further purpose or different character, altering the original work with new expression, meaning, or message. Such use is less likely to substitute for the original in the market.

The report directly confronts, and largely rejects, commonly made arguments that AI training is inherently transformative. It finds the argument that training is "non-expressive use" mistaken where models generate expressive content through words and images. It also rejects the analogy to human learning, noting that copyright law grants exclusive rights precisely because humans retain only imperfect impressions filtered through unique perspectives, unlike AI's potential for creating perfect copies.

Transformativeness is a matter of degree. The report suggests that training a model on a large and diverse dataset will often be transformative as it involves converting a massive collection of data into a statistical model, whereas training for models designed to produce content that competes with the original works is "at best, modestly transformative" and less likely to be found fair. Similarly, RAG is less likely to be transformative if it serves the same purpose as the original work.

A further consideration is the commerciality of the use, which the report considers should be determined based on the reality of whether the specific use in question serves a commercial or non-profit purpose. Use for commercial purposes is likely to weigh against fair use, as would knowing use of a dataset that consists of pirated or illegally accessed works (an issue that has arisen in a number of the cases).

Factor 2: nature of the copyrighted work

This factor considers how close the work is to the "core" of copyright protection. Use of more creative or expressive works (like novels, art, music), or unpublished works, is generally less likely to be fair use than use of factual or functional works (like computer code). While important, this factor often plays a less substantial role in the overall analysis.

Factor 3: amount and substantiality of the portion used

This factor looks at the quantity and quality of the material used in relation to the original work and whether the amount is reasonable given the purpose of the copying. Using entire works ordinarily weighs against fair use, although this may depend on the transformative purpose.

The report acknowledges that internet-scale mass copying of entire works may be technically necessary for certain types of AI training, particularly for achieving the performance of current-generation models. To the extent that there is a transformative purpose, use of entire works on that scale could potentially weigh in favour of fair use.

The report also notes that the amount made available to the public through outputs is relevant. For instance, where a training dataset has used entire works, but the model includes "guardrails" which succeed in preventing or minimising the output of protected material to users, then the weight attributed to this third factor will be lessened. Examples of guardrails include blocking certain prompts or using internal instructions to avoid generating infringing content.

Factor 4: effect on the market

Having been described by the US Supreme Court as the "single most important element of fair use", this factor examines the effect of the use upon the potential market for, and value of, the copyrighted work.

The report takes a broad view of market harm, acknowledging this is "uncharted territory". It considers harm not only to the market for the original work itself (lost sales from substitution) but also market dilution, where outputs compete in the market for that type of work, even if not substantially similar to a specific original.

The report establishes that the speed and scale at which AI systems generate content pose a serious risk of diluting markets for works of the same kind as in their training data. This can lead to increased competition for sales and difficulty for audiences in finding original works.

Lost licensing opportunities of rights holders are also considered. The report notes that markets for licensing data for AI training are developing, and that where licences exist or are likely to be feasible, unlicensed use will weigh against fair use.

The report concludes that, if a use enables the creation of works that compete directly with original works in existing markets, especially when done on a commercial scale, the fourth factor is likely to weigh strongly against fair use.

Overall, the report finds that there will not be a single answer regarding whether unauthorised use of copyrighted materials for AI training is fair use. It outlines a spectrum: non-commercial research uses with no public output of original works are more likely fair, while commercial uses enabling competitive outputs are less likely to fall within the defence.

Licensing and policy considerations

The report also touches upon the feasibility of licensing solutions. It notes that, while voluntary licensing agreements are emerging in some sectors, challenges remain in terms of scalability, transaction costs, and identifying rights holders, particularly for "vernacular works" posted online without expectation of monetisation.

The report declines to endorse a compulsory licensing regime, arguing that the potential harm outweighs the benefits and that such regimes should only be considered in clear instances of market failure and enacted narrowly by Congress. It also discusses the possibility of Extended Collective Licensing schemes providing greater flexibility and the possibility of statutory amendments to introduce opt-out schemes.

The report recommends allowing the licensing market to continue to develop for now without government intervention, noting the "relatively nascent" state of the law, technology, and markets.

Implications for stakeholders and future direction

The report's analysis casts some doubt on arguments that current AI training practices are broadly protected under the fair use doctrine in the US. For developers and technology companies, it signals a clear need for caution when incorporating copyrighted material into training datasets. The report encourages proactive steps, such as licensing content, and advises that businesses should closely track emerging case law and be ready to adapt their models as the legal landscape evolves.

Content creators and rights holders are also directly addressed. The report strengthens their argument that the use of their works for commercial AI training should not occur without authorisation and appropriate compensation.

Meanwhile, policymakers and industry groups are advised to expect increasing congressional attention. The report underlines the importance of ongoing dialogue around licensing standards, metadata practices, and transparency mechanisms, all of which are likely to shape future regulation and industry norms.

While the report's formal status remains unclear given the institutional uncertainty at the USCO, its thorough analysis is likely to impact the ongoing debate and litigation surrounding AI and copyright. For UK stakeholders—including developers, rights holders, and legal advisors—it offers a valuable insight into how US law may evolve and the principles that may inform international approaches.

For updates on international developments on copyright law and AI, as well as the various intellectual property cases concerning generative AI currently proceeding through the courts, sign up to our Generative AI Tracker here.