Open Data Infrastructure
Foundation for AI Evaluation Evidence Stores
Why AI programs need evaluation evidence stores that record prompts, tool calls, datasets, policies, lineage, scores, and reviewer decisions.
An evaluation score without evidence is a number asking you to trust its memory.
Evaluations need durable evidence
AI evaluations often start as test files, dashboards, or experiment notebooks. That is fine for exploration. It is not enough for production governance, especially when agents use tools, retrieve context, and change data-facing behavior.
The evaluation evidence store should preserve the run, prompt, tool calls, retrieved sources, dataset versions, policy decisions, lineage, metrics, judge outputs, reviewer decisions, and release link. Without that record, the score cannot explain itself later.
What the store should record
OpenAI evaluation guidance is useful for designing evals, while NIST AI RMF frames risk management around governance, mapping, measurement, and management. The ODI contribution is to make the data evidence behind those evaluations portable and reviewable.
Core idea: evaluation evidence is a data product. Treat it like one.
Why this belongs in ODI
Store evaluation evidence in open formats where possible, tie it to catalog metadata, record lineage events, and link policy decisions. If a model release depends on a passing eval, the release should carry the evidence, not only a green check.
For related patterns, read foundation for AI control plane architecture, foundation for AI data lineage SLAs, and AI-ready context evaluation datasets.
What breaks first
- An eval passes, but the dataset version is not stored.
- A tool call changes behavior, but the evidence store only records the final answer.
- A reviewer overrides a result without linking the decision to policy context.
- A release cites an evaluation run that cannot be replayed.
Evaluation evidence questions
Ask which dataset version was used, which sources were retrieved, which tools ran, which policy decisions applied, which scores mattered, and which reviewer accepted the result.
Sources to start with
These primary sources anchor the technical claims in this guide.
- OpenAI evals documentation
- NIST AI Risk Management Framework
- OpenLineage object model documentation
- W3C PROV-O recommendation
A passing eval should leave enough evidence to survive the next incident review.