GPAI model guidance and training data templates published by Commission

In brief

Following publication of the General-Purpose AI (GPAI) Code of Practice, the Commission has published Guidance that provides clarity on some unanswered questions, such as how GPAI models will be defined and classified, and how the Code will apply to 'downstream' model fine-tuners and integrators.
While the Code and the Guidance are not mandatory, GPAI model developers are required under the EU AI Act to publish from 2 August 2025 a summary of the training data used to create their GPAI models. GPAI models that are already on the EU market as at 2 August 2025 have until 2 August 2027 in which to comply.
The Commission has published a Template for GPAI model training data summaries that aims to provide "a common minimal baseline". Rights holders will no doubt in due course interrogate these summaries for the purposes of considering action against GPAI model developers for infringement of their IP rights.

It's been a busy couple of weeks for AI regulation in Europe. On 10 July 2025, the Commission published the final version of its long-awaited GPAI Code of Practice. You can read our article on what the voluntary Code means for your business. Subsequently, on 18 July 2025, the Commission published Guidance to supplement the Code, clarifying its scope on some key points. And then, on 24 July 2025, it published its AI training data disclosure template that relates to the mandatory obligation on in-scope model developers to publish information about the training of their models. In this article, we'll explore the implications of the Guidance and Template in more detail.

The Guidance

The Code is non-binding and is intended to assist model developers and providers to comply with their obligations under the EU AI Act. It governs three core legal and regulatory tenets of GPAI development and implementation: i) transparency; ii) copyright; and iii) safety and security. However, several key, fundamental questions concerning the scope and applicability of the Code are open to interpretation. The Guidance sets out "the Commission’s interpretation and application of the AI Act, on which it will base its enforcement action," and provides critical clarification on certain interpretative issues as they relate to both the Code and the AI Act.

For example:

Key issue	Guidance
When is an AI model a GPAI model?	The EU AI Act's definition of a GPAI model (Article 3(63)) does not set out specific criteria. The Commission will instead make an initial assessment based on i) the amount of training compute used to train the model, with a current threshold of more than 1023 FLOPs (an increase from a prior threshold of 1022 FLOPs); and ii) the model's 'language' modalities (i.e. whether it can produce text, image, video and/or audio outputs). However, models that do not strictly meet these criteria but nonetheless display "significant generality" and competence at "performing a wide range of distinct tasks" are still likely to be considered a GPAI model.
When is a GPAI model one with systemic risk?	Article 51(1) of the AI Act states that a GPAI model will be one with systemic risk if it meets either of the following conditions: the model has "high-impact capabilities" that "match or exceed those recorded in the most advanced models" (Article 3(64) AI Act); and/or the Commission (or qualified scientific panel) decides that it has an equivalent capability or impact. This systemic risk assessment must be carried out continuously throughout an entire model's lifecycle
How should downstream modifiers of GPAI models be classified?	Not every modification of a GPAI model will result in the downstream modifier being considered a model 'provider' under the AI Act. Instead, downstream modifiers are classified as providers of modified GPAI models only if their modifications lead to significant changes in the original model's generality, capabilities, or systemic risk. An indicative criterion used to determine significant modification is if the training compute used for the modification exceeds one-third of the training compute of the original model. If this threshold is crossed, the Commission treats it as the release of a new model. The downstream modifier must then comply with the AI Act immediately, as the so-called 'grandfathering clause' – permitting GPAI models placed on the market before 2 August 2025 to have until 2 August 2027 to comply - no longer applies. If the original model had systemic risk, any significant modification is presumed to maintain high-impact capabilities, making the modifier a provider of a GPAI model with systemic risk.
What exemptions to the AI Act's obligations apply to those releasing open-source models?	Models released under a free and open-source licence are exempt from certain obligations – but only if the model parameters, weights, architecture, and usage information are also made publicly available. Exempted obligations include: keeping technical documentation (Article 53(1)(a)) making documentation available to integrators of the model into AI systems (Article 53(1)(b)) appointing an authorised representative if established in a third country (Article 54) However, these exemptions do not apply if the model is classified as a GPAI model with systemic risk. Additionally, all providers remain obliged to comply with EU copyright law and produce training data summaries (see below).

Given the pace and scale of GPAI's development, periodic reviews of the Guidance and the Code will be crucial. The independent experts who contributed to the creation of the Code have called for them both to be regularly reviewed and benchmarked against frontier model capabilities.

The Template

From 2 August 2025, GPAI model providers are required by the AI Act (Article 53(1)(d)) to publish on their official website a summary of the content used to train their models. The Commission has now published, following an extensive consultation, its training summary Template. Completion of the Template is mandatory and represents the only way for GPAI model providers to provide the required information.

However, developers of existing GPAI models (i.e. those placed on the EU market before 2 August 2025) have until 2 August 2027 in which to comply (and can assert that the relevant information is not available or it would be disproportionate to provide it, though they should clearly state this and provide justification).

Failure to provide the training data summary via the Template could lead to a fine of up to 3% of annual worldwide turnover, or €15 million, whichever is higher. However, the Commission will not begin enforcing the requirements until 2 August 2026.

What does the Template cover?

One of the most significant concerns of rights holders has been the lack of transparency over the sources of training data used for GPAI models. The Template is designed to overcome these concerns, and to enable them to identify the types of content used, and the extent to which the conditions for lawful text and data mining (including respect for rights reservations) have been complied with. The Template also requires providers to disclose whether the model has been trained on user data, thereby facilitating data subjects' rights. It remains notable, however, that beyond this, there is limited reference to data protection, and limited cross-referencing to the GDPR, in either the Code or the Guidance. Key concepts under the GDPR, such as data protection impact assessments, and, indeed, data protection officers, could arguably have a much more important role to play here, but are not mentioned.

Transparency over training data may also contribute in other ways: assessing data diversity, freedom of science, and encouraging competitive markets. Meanwhile, to counter model providers' concerns about disclosing trade secrets, the Commission has implemented different levels of detail depending upon the relevant data source.

The Template requires the following three main types of information to be provided in relation to all stages of the model training process, as a "uniform baseline":

General information: in addition to details about the provider and the model, this includes information on the modalities (text, image, audio, video, and any other type of training data), overall training data size, and other general characteristics of the training data.
List of data sources: providers are required to provide information about training data sources, including:
- Publicly available datasets: the obligation is to provide a list of large publicly available datasets used (i.e. where the total data size for any one of the modalities is more than 3% of the size of all publicly available datasets for that modality), and a general description of other publicly available datasets.
- Private non-publicly available datasets obtained from third parties: limited information (in recognition of trade secrets concerns) is required in relation to commercially licensed datasets and other private datasets obtained from third parties (where these are publicly known).
  
  In practice, it may be difficult to distinguish between datasets that have put in place commercial licences with all of the relevant rightsholders and representatives versus those that have not, but it is in theory possible. Examples could include the Swiss 'public model' and the Dutch media's national collective licencing database for AI training.
- Data crawled and scraped from online sources: this section requires a comprehensive description of the crawlers used, their purpose and behaviour, the period of collection, and a comprehensive description of the types of content and online sources crawled. It also requires a list in summarised narrative form of the top 10% of all domain names crawled (calculated by the size of the content scraped) – but not of the specific data and works themselves. Providers may however provide more information on a voluntary basis.
  
  Here, the EU appears to be trying to strike a balance between complete disclosure, practicality, and fairness. The obligation is lower at 5% or top 1,000 domains for SMEs and it will be interesting to see whether the 10% figure is revised over time.
- User data: confirmation as to whether user data is used to train the model and a general description of services or products used to collect user data.
- Synthetic data: this includes data created by the provider for training the model, in particular through model distillation or model alignment.
Relevant data processing aspects:
- Information on the measures implemented before model training to respect reservation of rights before and during data collection, including compliance with opt-outs and other solutions.
- Measures taken to avoid or remove illegal content such as blacklists, key words and model-based classifiers (but without requiring disclosure of trade secrets).

Rights holders are likely to be concerned that the level of information required by the Template will not, in practice, provide them with sufficient information to assess whether their works have been used as training data. The Commission recommends that providers act in good faith, and on a voluntary basis provide relevant information on request to rights holders, but it remains to be seen how effective this voluntary 'upon request' mechanism will be. It will also be interesting to monitor how anticipated concerns over whether satisfactory levels of information have been provided are dealt with and enforced.

Meanwhile, the UK Government will also be watching developments closely, given the central prominence it has given to the issue of transparency in its consultation on GenAI and IP.