Training-data provenance used to sound like a research concern. Now it is becoming an operating concern for anyone building or buying serious AI systems.

The EU AI Act creates new obligations around AI governance, documentation, and transparency. General-purpose AI providers have specific duties, and the regulation discusses information about data used for training. Legal teams should interpret the details. Data teams still have to build the evidence path.

That is where ODI matters. You cannot document provenance at the end if the infrastructure never captured it at the beginning.

Provenance is more than a source list

For AI training and fine-tuning, provenance should answer practical questions. Where did the data come from? Who owned it? What rights or restrictions applied? How was it transformed? Which versions were used? Which records were excluded? Who approved the use?

A static spreadsheet will not hold up when datasets change, pipelines rerun, and models are refreshed. Provenance has to live in the data path through lineage, catalogs, policy records, and versioned data products.

Core idea: AI provenance is not paperwork after training. It is infrastructure before training.

Open infrastructure makes the record durable

Closed data paths make provenance fragile because the evidence is trapped in private logs, vendor-specific metadata, or manual review notes. Open table formats, portable lineage events, catalog metadata, and policy-as-infrastructure patterns make the record easier to inspect and preserve.

That does not make compliance automatic. It makes the facts easier to produce when compliance asks for them.

Questions to ask before data touches a model

  • Can we identify the source, owner, license, consent boundary, and policy for this data?
  • Can we prove which version of the dataset was used?
  • Can we trace transformations from raw source to training input?
  • Can we remove or exclude data and prove the downstream effect?

If the infrastructure cannot answer those questions, the model is borrowing trust it has not earned.

Sources to start with

These are the primary sources I would start from when checking the claims in this piece.