Open Data Infrastructure
Synthetic Data and Open Data Infrastructure
Synthetic data still needs provenance, lineage, quality controls, and policy. ODI makes synthetic datasets auditable instead of magical.
Synthetic data does not become safe because the rows are fake.
Fake rows can create real risk
Synthetic data is useful. It can support testing, development, model training, privacy-preserving workflows, and data sharing where raw production data would be inappropriate.
It can also create a false sense of safety. If nobody knows how the data was generated, what source population shaped it, which sensitive attributes were preserved, which biases were amplified, or which policies apply, the word synthetic does not help much.
Core idea: synthetic data is a data product with a generation process. Govern the process, not only the output.
Provenance is not optional
Synthetic datasets need provenance. Teams should know which source data was used, which generator created the output, which parameters were applied, which privacy checks ran, and which quality tests passed or failed.
That provenance should travel with the dataset. Otherwise downstream teams may treat synthetic data as production-like when it is only shape-like. The difference matters for training, testing, and decision support.
Open tables make synthetic data auditable
Open table formats are useful for synthetic data because they preserve versions, snapshots, schema evolution, and repeatable access paths. A synthetic dataset should be reproducible enough that a team can answer which version trained a model or supported a test.
ODI adds the surrounding controls: catalog metadata, owners, access rules, retention, lineage, and quality signals. The generated data is only one part of the contract.
Quality tests need to match the use case
A synthetic dataset for UI testing needs different proof than a synthetic dataset for model training. The first may need referential integrity, realistic edge cases, and safe fake values. The second needs distribution checks, leakage review, bias testing, and clear limits on where the data can be used.
Do not let one synthetic-data label cover every use case. Name the purpose and the risk class.
A synthetic data governance checklist
- Record source datasets, generator version, parameters, and run time.
- Document intended use and prohibited use.
- Run privacy, leakage, bias, and distribution checks appropriate to the use case.
- Version the output and keep lineage to downstream models or tests.
- Expire synthetic data when the source contract or policy changes.
Synthetic data can reduce risk. It cannot remove the need to know what the data is and where it came from.
Sources to start with
Use AI governance and provenance sources to frame synthetic data as a managed data product, not a shortcut around governance.