An evaluation score without evidence is a number asking you to trust its memory.

Evaluations need durable evidence

AI evaluations often start as test files, dashboards, or experiment notebooks. That is fine for exploration. It is not enough for production governance, especially when agents use tools, retrieve context, and change data-facing behavior.

The evaluation evidence store should preserve the run, prompt, tool calls, retrieved sources, dataset versions, policy decisions, lineage, metrics, judge outputs, reviewer decisions, and release link. Without that record, the score cannot explain itself later.

What the store should record

OpenAI evaluation guidance is useful for designing evals, while NIST AI RMF frames risk management around governance, mapping, measurement, and management. The ODI contribution is to make the data evidence behind those evaluations portable and reviewable.

Core idea: evaluation evidence is a data product. Treat it like one.

Why this belongs in ODI

Store evaluation evidence in open formats where possible, tie it to catalog metadata, record lineage events, and link policy decisions. If a model release depends on a passing eval, the release should carry the evidence, not only a green check.

For related patterns, read foundation for AI control plane architecture, foundation for AI data lineage SLAs, and AI-ready context evaluation datasets.

What breaks first

  • An eval passes, but the dataset version is not stored.
  • A tool call changes behavior, but the evidence store only records the final answer.
  • A reviewer overrides a result without linking the decision to policy context.
  • A release cites an evaluation run that cannot be replayed.

Evaluation evidence questions

Ask which dataset version was used, which sources were retrieved, which tools ran, which policy decisions applied, which scores mattered, and which reviewer accepted the result.

Sources to start with

These primary sources anchor the technical claims in this guide.

A passing eval should leave enough evidence to survive the next incident review.