AI incidents do not wait while teams argue which system owns the evidence.

AI incident response starts in the data layer

When an AI system gives a bad answer, calls the wrong tool, retrieves restricted context, or proposes a bad data change, the investigation quickly leaves the model. Teams need table state, catalog metadata, retrieval paths, tool logs, policy decisions, lineage, and owners.

Open data infrastructure matters because the evidence cannot live inside one vendor console. Incident responders need portable facts that can be joined across tools.

Evidence has to cross systems

Useful evidence includes Iceberg snapshot IDs, catalog objects, OpenLineage runs, OPA decision logs, tool registry entries, evaluation runs, and human review records. Each piece explains a different part of the incident.

Core idea: an AI incident runbook should start with evidence paths, not screenshots.

Build the ODI runbook

The runbook should answer five questions quickly: what did the agent see, which tool did it call, which data state was involved, which policy allowed the path, and who owns remediation. If any field is missing, the investigation turns into archaeology.

For related guidance, read foundation for AI metadata incident response, context graphs for AI root cause analysis, and agentic data replay logs.

What breaks first

  • The model trace exists, but the retrieved data source version is missing.
  • Policy logs show allow or deny decisions without tool context.
  • Lineage identifies a dataset but not the snapshot or table state used.
  • No owner can approve remediation because ownership lives outside the incident system.

Incident questions

Ask whether responders can reconstruct the context path, policy path, tool path, data path, and owner path in one review packet. If not, the AI platform is not incident-ready.

Sources to start with

These primary sources anchor the technical claims in this guide.

Incident response works when the evidence is already infrastructure.