DataFusion as an Embedded Query Engine for Agents

Agents need more than a SQL endpoint. They need a way to reason over data without dragging every question through one distant control point.

The practical problem

Agentic systems often need small, bounded, policy-aware computations close to the workflow. A support agent may inspect a local customer context bundle. A data quality agent may compare a sample against schema metadata. A developer agent may query a cached slice of lineage or table statistics.

Apache DataFusion is interesting because it is an embeddable query engine built in the Arrow ecosystem. In ODI terms, that means execution can move into applications, services, and agent tools while data still uses open formats and explicit contracts.

Core idea: embedded query execution is powerful only when the agent receives governed data plus the metadata needed to interpret it.

The ODI boundary

DataFusion is compute. It can be embedded inside a service or agent runtime. That makes it different from a shared warehouse, but it does not remove the need for catalog, policy, lineage, and audit.

The boundary to protect is context. An agent should not receive a random directory of files and improvise. It should receive a scoped dataset, schema, source history, freshness, policy status, and execution limits. The query engine then answers a bounded question inside that envelope.

Patterns that work

Use DataFusion where the workflow benefits from local execution: validation, summarization over a governed extract, agent-side planning, or product features that need embedded analytical behavior. Keep large-scale preparation in the lakehouse and send the agent a constrained slice.

Expose data through a typed tool contract. If an MCP server or similar agent interface hands the model a query tool, that tool should enforce scope before execution. The agent should not decide which files or tables it is allowed to read.

Preserve source context. If the agent queries Parquet, include table, snapshot, schema version, export job, and policy metadata. Without that, the answer may be fast, cheap, and impossible to trust.

Failure modes

The first failure is embedding a query engine and calling it governance. That only moves execution closer to the agent. It does not answer access, lineage, or meaning.

The second failure is unbounded tools. An agent with a generic SQL surface and broad file access can accidentally turn local execution into a data exfiltration path.

The third failure is forgetting reproducibility. If the agent output matters, the team needs to know which data slice, query, metadata version, and policy state produced it.

Questions to ask

Which data can the embedded engine read, and who decided that boundary?
Does the agent see schema and lineage with the data?
Can execution be replayed with the same input slice and policy state?
Which queries are allowed, limited, or denied before execution?
When should the workload move from embedded execution back to a shared engine?

Sources to start with

Start with the primary docs. They are the contracts you can test against, not commentary about the contracts.

ODI for agents Agent reference architecture Article library

Get started with Apache Iceberg, today! Want to learn more? Visit https://www.opendatainfrastructure.com/