AI training data documentation and disclosure

ai-training-data-disclosureDomain: ai-transparencyType: policy

Description

Training-data disclosure obligations represent a particularly Levine-ian regulatory move: the statutes do not, for the most part, regulate what an AI model is allowed to be trained on; they regulate whether the public gets to know what it was trained on. California AB 2013 requires the documentation; the EU AI Act's general-purpose AI provisions require the documentation; China's algorithm registry filing requires (in effect) the documentation. The premise is that the market and the regulator can do the substantive work of evaluating training-data practices once visibility exists, but the visibility has to be built first. The result is an obligation that looks like a paperwork exercise from the outside and turns out, on contact with a modern training pipeline, to be a genuinely hard data-engineering problem. A working training-data documentation program has five pieces. The source inventory comes first: every dataset that touched the training run, with its acquisition path (licensed, scraped, user-generated, synthetic, purchased through a data broker), its date range, and its license terms. The category map comes second: protected attributes present in the data, sensitive categories such as health and biometric and minors-derived signal, and the categorical breakdown by content type. The scraping-disclosure summary comes third for any portion of the corpus derived from web scraping, with attention to whether the source sites had machine-readable opt-out signals (robots.txt, ai.txt, X-Robots-Tag) at the time of collection. The protected-attribute handling policy comes fourth, describing how attributes like race, gender, disability, or age were collected, inferred, or excluded from the training signal. The publication surface comes fifth, because the documentation only satisfies the regulation when the public can read it: a data card on the model release page, a per-region addendum where local law requires, and an archived version-controlled record so a regulator can retrieve the documentation as it existed at training time. The thresholds are uneven across the three regimes. California AB 2013 applies to generative AI systems made available to Californians and takes effect 2026-01-01; it requires the documentation to be posted on the developer's website before the system is made available. The EU AI Act Article 53 obligations on general-purpose AI model providers, with enhanced requirements at the systemic-risk threshold of 10^25 cumulative training FLOPs, take effect 2026-08-02. China's algorithm registry filing under the Internet Information Service Algorithmic Recommendation Management Provisions has been in force since 2022-03-01 and operates as a prior-notification regime rather than a public-disclosure regime; the documentation lives with the regulator rather than the public. The substantive content of the documentation overlaps meaningfully across all three, which makes a single master data card the natural authoring surface, with jurisdiction-specific addenda layered on top. The failure mode worth naming is that training-data documentation is usually written by the ML team after the fact, against a pipeline they did not design for retrospective documentation. The cost of bolting documentation onto a pipeline that was not instrumented to record provenance is dramatically higher than the cost of building the instrumentation in. Operators who treat AB 2013 as a documentation problem rather than a pipeline-instrumentation problem tend to discover that the actual gap is data-engineering capacity, not legal review. Where the documentation reveals practices that the operator would prefer not to disclose, the regulatory pressure tends to shift the underlying practice rather than the documentation: the disclosure is the lever, not the artifact.

Applicability

Applies when: AI role is ai-provider.

How predicates are evaluated

Required by (4 regulations)

California AB 2013
AB 2013 requires developers of generative AI systems made available to Californians to publish, on their website, documentation describing the datasets used to train the system. Required elements include the sources or owners of the datasets, a description of how the datasets further the intended purpose, the number of data points, a description of types of data, whether the datasets include personal information or aggregate consumer information, whether the datasets were purchased or licensed, and the time period during which the data was collected.
California AB 2013 (generative AI training data transparency); effective 2026-01-01
Source →
EU AI Act
Article 53 obliges providers of general-purpose AI models to draw up and keep up to date technical documentation of the model, make information and documentation available to downstream providers integrating the model, put in place a policy to comply with EU copyright law, and publish a sufficiently detailed summary of the content used to train the model per the template provided by the AI Office. Models above the systemic-risk threshold (10^25 cumulative training FLOPs) carry additional obligations under Article 55.
Regulation (EU) 2024/1689 of the European Parliament and of the Council (Artificial Intelligence Act); Articles 53 and 55 general-purpose AI obligations effective 2026-08-02
California SB 53
SB 53 imposes transparency, safety-framework publication, and critical-safety-incident reporting obligations on frontier AI developers above the compute-threshold floor. Training-data documentation feeds into the safety framework and the published model specification.
California SB 53 (Frontier Artificial Intelligence Transparency Act); effective 2026-01-01 for covered developers above the training-compute threshold
Source →
Algorithm Provisions
The Internet Information Service Algorithmic Recommendation Management Provisions require operators with public-opinion or social-mobilization capacity to file an algorithm registry submission with the Cyberspace Administration of China within ten working days of providing the service. The filing includes information on the algorithm's data sources, training corpus characteristics, and intended use; the documentation is held by the regulator rather than published publicly.
Provisions on the Management of Algorithmic Recommendations in Internet Information Services (jointly issued by CAC, MIIT, MPS, and SAMR; effective 2022-03-01)

Evidence formats

training-data documentation flow or master data card
source-licensing inventory with acquisition path and license terms per dataset
protected-attribute handling policy
web-scraping disclosure summary including opt-out signal handling
per-jurisdiction addenda (California AB 2013 surface, EU GPAI summary, China registry filing)
version-controlled archive of documentation as it existed at training time

Magist provides legal information based on publicly available regulatory sources. It does not constitute legal advice and does not create an attorney-client relationship. Consult a licensed attorney in your jurisdiction before making compliance decisions.

Description

Required by (4 regulations)

California AB 2013

AB 2013 requires developers of generative AI systems made available to Californians to publish, on their website, documentation describing the datasets used to train the system. Required elements include the sources or owners of the datasets, a description of how the datasets further the intended purpose, the number of data points, a description of types of data, whether the datasets include personal information or aggregate consumer information, whether the datasets were purchased or licensed, and the time period during which the data was collected.

California AB 2013 (generative AI training data transparency); effective 2026-01-01

Source →

EU AI Act

Article 53 obliges providers of general-purpose AI models to draw up and keep up to date technical documentation of the model, make information and documentation available to downstream providers integrating the model, put in place a policy to comply with EU copyright law, and publish a sufficiently detailed summary of the content used to train the model per the template provided by the AI Office. Models above the systemic-risk threshold (10^25 cumulative training FLOPs) carry additional obligations under Article 55.

Regulation (EU) 2024/1689 of the European Parliament and of the Council (Artificial Intelligence Act); Articles 53 and 55 general-purpose AI obligations effective 2026-08-02

California SB 53

SB 53 imposes transparency, safety-framework publication, and critical-safety-incident reporting obligations on frontier AI developers above the compute-threshold floor. Training-data documentation feeds into the safety framework and the published model specification.

California SB 53 (Frontier Artificial Intelligence Transparency Act); effective 2026-01-01 for covered developers above the training-compute threshold

Source →

Algorithm Provisions

The Internet Information Service Algorithmic Recommendation Management Provisions require operators with public-opinion or social-mobilization capacity to file an algorithm registry submission with the Cyberspace Administration of China within ten working days of providing the service. The filing includes information on the algorithm's data sources, training corpus characteristics, and intended use; the documentation is held by the regulator rather than published publicly.

Provisions on the Management of Algorithmic Recommendations in Internet Information Services (jointly issued by CAC, MIIT, MPS, and SAMR; effective 2022-03-01)

Evidence formats

training-data documentation flow or master data card

source-licensing inventory with acquisition path and license terms per dataset

protected-attribute handling policy

web-scraping disclosure summary including opt-out signal handling

per-jurisdiction addenda (California AB 2013 surface, EU GPAI summary, China registry filing)

version-controlled archive of documentation as it existed at training time