Open Data Infrastructure
DataFusion Policy-Aware Query Services
How embedded query services can combine DataFusion plans, catalog metadata, and policy checks.
A data API that hides every query behind a friendly endpoint can still leak the hardest part of the architecture. Someone still has to decide what the query is allowed to see.
The practical problem
Apache DataFusion gives teams a query engine they can embed in services. Its DataFrame API builds on logical plans, and its SQL support lets services run familiar queries over registered data. That is powerful for data product APIs, especially when teams want query behavior without handing every consumer a warehouse login.
The governance risk is equally direct. Once a service embeds query execution, the service also owns the policy boundary unless the architecture connects plans, catalogs, identities, and authorization decisions explicitly.
Plans need policy context
A policy-aware service should inspect the logical plan before execution. Which tables does the query touch? Which columns appear? Which predicates are present? Which consumer is asking? Which data product contract applies? That context can feed Open Policy Agent or another policy decision point before the plan becomes work.
The service should also record the answer. A successful request should include enough evidence to reconstruct the plan, policy decision, data product version, and lineage event. A denied request should explain the denial in a way a developer can act on.
Core idea: Embedded execution does not remove governance. It moves governance closer to the code path.
The ODI pattern
Open Data Infrastructure works best when execution engines are not forced to become the whole platform. DataFusion can execute. The catalog can describe ownership and table contracts. A policy engine can evaluate access. Lineage can record what happened. The service should connect those parts instead of pretending one component can do all of them.
This is especially useful for governed APIs. The API can expose a narrow interface while the underlying query path remains inspectable, portable, and testable.
What breaks first
- The service checks authentication but never checks column-level or row-level policy.
- The query plan is optimized and executed without leaving audit evidence.
- Policy decisions depend on route names instead of the actual tables and fields used.
- Consumers get successful answers but no lineage back to the data product contract.
Questions to ask
Ask whether the service can explain a query before it runs. Ask whether policy sees the same table and column context the engine sees. Ask whether denied requests create useful evidence instead of generic errors.
For related architecture, read DataFusion query plans for governed APIs, policy enforcement in open data systems, and data lineage for AI-ready infrastructure.
Sources to start with
These primary sources anchor the technical claims in this guide.
- Apache DataFusion DataFrame API
- Apache DataFusion SQL reference
- Open Policy Agent documentation
- OpenLineage documentation
A governed query service should make the execution path visible before it makes the answer available.