Foundation Models Need Data Contracts

Foundation models are good at absorbing ambiguity. That is useful for language. It is dangerous for enterprise data.

The practical problem

Teams sometimes treat better models as a substitute for better data contracts. The model can parse messy fields, infer intent, translate schema names, and produce an answer that sounds coherent. Great. Also terrifying.

Foundation model systems still need data contracts. They need to know which datasets are approved, which fields mean what, which policies apply, which sources are authoritative, which values are uncertain, and which actions are allowed.

Core idea: the more capable the model, the more explicit the data contract has to be.

The ODI boundary

The model boundary and the data boundary are different. A model can reason over context. It should not be responsible for inventing the contract that decides whether the context is valid.

Data contracts define expected shape, meaning, ownership, lineage, freshness, policy, and use constraints. In an ODI stack, those contracts should be available to retrieval systems, agent tools, evaluation pipelines, and human reviewers.

Patterns that work

Use contracts at the retrieval boundary. Before the model sees context, the system should know which source is approved, which fields are allowed, and which quality signals apply.

Use contracts in evaluation. If an AI system answers questions about revenue, eligibility, risk, or customer status, the evaluation set should reflect the canonical definitions and known edge cases. Do not evaluate against vibes in a spreadsheet.

Use contracts for action. A model can draft a recommendation from uncertain data. It should not trigger a regulated decision unless the supporting contract meets the risk threshold.

Failure modes

The first failure is prompt-as-contract. A system prompt says "use trusted data," but the tool layer does not enforce which data is trusted.

The second failure is schema guessing. The model chooses a likely column, joins on a plausible key, and returns an answer with no way to prove the business meaning.

The third failure is evaluation without lineage. The answer is scored as correct or incorrect, but nobody knows which source state produced it.

Questions to ask

Which data contracts are available to retrieval and agent tools?
Can the model see source, freshness, lineage, and policy with the context?
Which contracts are required before an action is automated?
Can evaluation cases trace expected answers to governed data sources?
What happens when the contract is missing, stale, or violated?

For related AI-readiness work, read AI Data Contracts for Agents, Data Readiness for Fine-Tuning, and AI-Ready Data Quality Signals.

Sources to start with

Start with the primary docs. They are the contracts you can test against, not commentary about the contracts.

AI data contracts AI quality signals ODI for agents

Get started with Apache Iceberg, today! Want to learn more? Visit https://www.opendatainfrastructure.com/