A data product API should not be a fancy tunnel into one warehouse. It should be a governed interface over data that the organization still controls.

The practical problem

Data product teams increasingly need APIs that answer domain-specific questions, not just raw SQL endpoints. They want controlled filters, predictable schemas, policy-aware responses, and enough performance for product use.

Apache DataFusion is interesting here because it is an embeddable query engine built around Arrow concepts. It can sit inside a service that exposes governed data product APIs while leaving the storage and table contracts open.

Core idea: DataFusion is most valuable when the API owns the product contract and the open data layer owns the portability contract.

The architecture that works

The API should expose product operations, not arbitrary infrastructure. A customer risk API, inventory availability API, or revenue quality API can compile approved requests into DataFusion query plans over open files, tables, or registered data sources.

Policy should wrap the request before query execution. The service should know the caller, the purpose, the allowed domain, and the approved fields. DataFusion can execute the query. It should not be asked to become the entire governance brain.

Lineage should be captured at the API boundary and the execution boundary. The team needs to know which data product version served the response, which source tables contributed, which policy applied, and which downstream system consumed the result.

What breaks first

  • The API exposes raw SQL and accidentally recreates a shadow warehouse.
  • Policy is implemented only in application code and never reaches catalog, lineage, or audit systems.
  • Performance tuning pushes teams toward private caches with undocumented semantics.
  • Data product owners change schemas without versioning the API contract.

Questions to ask

Use these questions when DataFusion sits behind a data product API.

  • Which API operations are allowed, and which data fields can each operation reach?
  • Which open table or file contracts supply the data?
  • Where are policy decisions evaluated and logged?
  • Can another engine read the same source data without changing the product contract?
  • Can a consumer trace an API response back to source tables and product version?

For adjacent patterns, read Table Formats, Catalogs, and Query Engines, Query Engines in ODI, and Agentic Data Product Design.

Sources to start with

DataFusion can be the execution engine. The architecture still has to define the product, policy, and lineage contracts.