Open Data Infrastructure
DataFusion as an Embedded Query Engine for Agents
Apache DataFusion matters for agents because governed query execution will not always live in a remote warehouse.
Agents need more than a SQL endpoint. They need a way to reason over data without dragging every question through one distant control point.
The practical problem
Agentic systems often need small, bounded, policy-aware computations close to the workflow. A support agent may inspect a local customer context bundle. A data quality agent may compare a sample against schema metadata. A developer agent may query a cached slice of lineage or table statistics.
Apache DataFusion is interesting because it is an embeddable query engine built in the Arrow ecosystem. In ODI terms, that means execution can move into applications, services, and agent tools while data still uses open formats and explicit contracts.
Core idea: embedded query execution is powerful only when the agent receives governed data plus the metadata needed to interpret it.
The ODI boundary
DataFusion is compute. It can be embedded inside a service or agent runtime. That makes it different from a shared warehouse, but it does not remove the need for catalog, policy, lineage, and audit.
The boundary to protect is context. An agent should not receive a random directory of files and improvise. It should receive a scoped dataset, schema, source history, freshness, policy status, and execution limits. The query engine then answers a bounded question inside that envelope.
Patterns that work
Use DataFusion where the workflow benefits from local execution: validation, summarization over a governed extract, agent-side planning, or product features that need embedded analytical behavior. Keep large-scale preparation in the lakehouse and send the agent a constrained slice.
Expose data through a typed tool contract. If an MCP server or similar agent interface hands the model a query tool, that tool should enforce scope before execution. The agent should not decide which files or tables it is allowed to read.
Preserve source context. If the agent queries Parquet, include table, snapshot, schema version, export job, and policy metadata. Without that, the answer may be fast, cheap, and impossible to trust.
Failure modes
The first failure is embedding a query engine and calling it governance. That only moves execution closer to the agent. It does not answer access, lineage, or meaning.
The second failure is unbounded tools. An agent with a generic SQL surface and broad file access can accidentally turn local execution into a data exfiltration path.
The third failure is forgetting reproducibility. If the agent output matters, the team needs to know which data slice, query, metadata version, and policy state produced it.
Questions to ask
- Which data can the embedded engine read, and who decided that boundary?
- Does the agent see schema and lineage with the data?
- Can execution be replayed with the same input slice and policy state?
- Which queries are allowed, limited, or denied before execution?
- When should the workload move from embedded execution back to a shared engine?
For related AI architecture, read Open Data Infrastructure for AI Agents, MCP Servers Need a Governed Data Layer, and Metadata as Prompt Context.
Sources to start with
Start with the primary docs. They are the contracts you can test against, not commentary about the contracts.