Synthetic Data and Open Data Infrastructure

Synthetic data does not become safe because the rows are fake.

Fake rows can create real risk

Synthetic data is useful. It can support testing, development, model training, privacy-preserving workflows, and data sharing where raw production data would be inappropriate.

It can also create a false sense of safety. If nobody knows how the data was generated, what source population shaped it, which sensitive attributes were preserved, which biases were amplified, or which policies apply, the word synthetic does not help much.

Core idea: synthetic data is a data product with a generation process. Govern the process, not only the output.

Provenance is not optional

Synthetic datasets need provenance. Teams should know which source data was used, which generator created the output, which parameters were applied, which privacy checks ran, and which quality tests passed or failed.

That provenance should travel with the dataset. Otherwise downstream teams may treat synthetic data as production-like when it is only shape-like. The difference matters for training, testing, and decision support.

Open tables make synthetic data auditable

Open table formats are useful for synthetic data because they preserve versions, snapshots, schema evolution, and repeatable access paths. A synthetic dataset should be reproducible enough that a team can answer which version trained a model or supported a test.

ODI adds the surrounding controls: catalog metadata, owners, access rules, retention, lineage, and quality signals. The generated data is only one part of the contract.

Quality tests need to match the use case

A synthetic dataset for UI testing needs different proof than a synthetic dataset for model training. The first may need referential integrity, realistic edge cases, and safe fake values. The second needs distribution checks, leakage review, bias testing, and clear limits on where the data can be used.

Do not let one synthetic-data label cover every use case. Name the purpose and the risk class.

A synthetic data governance checklist

Record source datasets, generator version, parameters, and run time.
Document intended use and prohibited use.
Run privacy, leakage, bias, and distribution checks appropriate to the use case.
Version the output and keep lineage to downstream models or tests.
Expire synthetic data when the source contract or policy changes.

Synthetic data can reduce risk. It cannot remove the need to know what the data is and where it came from.

Sources to start with

Use AI governance and provenance sources to frame synthetic data as a managed data product, not a shortcut around governance.

ODI hub Article library Use the scorecard AI provenance AI-ready data

Get started with Apache Iceberg, today! Want to learn more? Visit https://www.opendatainfrastructure.com/