Agent evaluation gets weird when the test data only exists inside production infrastructure.

The practical problem

AI teams need evaluation sets that can be inspected, versioned, and replayed. Too often those fixtures are buried in an application database, a vector index, or a notebook that only works on one machine.

DuckDB is useful here because it can query local CSV, JSON, Parquet, and other files with SQL. That makes it a practical harness for checking retrieval fixtures, expected answers, source metadata, and data quality rules before an agent touches production systems.

Local does not mean toy

A good evaluation harness should make the small loop cheap. Can the agent retrieve the right customer record? Can it reject stale context? Can it explain which source produced the answer? Can the test fixture show the expected row without a cloud warehouse round trip?

DuckDB fits that loop because the test can travel with the repository. A fixture directory can include Parquet files, JSON tool outputs, expected answer tables, and SQL checks. The same harness can run in CI, on a laptop, or inside a controlled build step.

Core idea: agent evals need portable data fixtures, not just prompt examples.

The ODI pattern is fixture plus evidence

Treat each evaluation set like a small data product. Name the source, grain, owner, freshness rule, policy status, and expected failure behavior. Then use SQL to make the contract executable.

This pairs naturally with AI-ready evaluation sets. The model output is only one artifact. The data behind the output needs its own evidence trail, especially when retrieval, tool calls, and policy checks are part of the answer.

What breaks first

  • Evaluation examples are hand-written prompts with no source data behind them.
  • The vector index is tested, but the structured records that should govern the answer are not.
  • Fixtures contain realistic data but no lineage, freshness, or policy fields.
  • Local tests pass because they ignore the same constraints production enforces.

Questions to ask

  • Can a developer rerun the evaluation set without production credentials?
  • Can the harness test retrieval, structured data, and expected denial behavior?
  • Which files are source fixtures and which are derived evidence?
  • Does every expected answer point back to a source row or document?

For related context, read DuckDB for Open Lakehouse Quality Checks, Why RAG Needs Open Data Infrastructure, and Retrieval Governance in Open Data Infrastructure.

Sources to start with

These primary sources anchor the technical claims in this guide.

The cheapest eval loop is the one your data can carry with it.