A Reference Architecture for Open Data Infrastructure

A useful open data infrastructure reference architecture has one job: make the important data contracts explicit enough that teams can change tools without losing control.

The architecture shape

Think in three planes.

Data plane: source access, storage, open table formats, files, streams, and physical movement.
Control plane: catalogs, metadata, identity, policy, credentials, lineage, observability, and table operations.
Application plane: query engines, transformation tools, BI, data products, AI agents, and operational applications.

The control plane is the part teams under-design. They buy storage and compute, then hope governance and metadata will magically appear later. It does not work that way. The control plane is where portability becomes real.

Data plane

The data plane starts with reliable access to source systems. That may mean APIs, CDC, exports, queues, streams, or batch replication. The goal is not religious purity. The goal is a documented path from source systems into a governed analytical and AI-ready foundation.

Once data lands, table formats matter. Apache Iceberg is important because the table contract includes schema evolution, partition evolution, snapshots, manifests, deletes, and concurrency behavior. Those details are not trivia. They are what let multiple engines coordinate around the same data without guessing.

Control plane

The control plane coordinates what the data means and who can do what with it. At minimum, it includes a catalog, identity integration, access policy, lineage, quality signals, audit logs, and metadata APIs.

Catalogs are especially important because they sit between physical tables and compute engines. The Iceberg REST Catalog specification exists because pluggable catalogs created compatibility problems across languages, engines, and commercial offerings. A common protocol gives engines one client pattern for catalog interaction instead of a pile of bespoke integrations.

Practical test: if you change query engines, the data plane and control plane should still agree on table state, permissions, and metadata.

Application plane

The application plane is where people feel the architecture. Analysts query data. Engineers build pipelines. Data products expose APIs. Agents retrieve context. Executives see metrics. If the lower layers are closed or ambiguous, this layer gets expensive fast.

The application plane should not own the core meaning of the data by accident. A BI tool can present metrics, but the metric definition should not become impossible to reuse. An AI agent can retrieve context, but it should not invent lineage because the platform failed to expose it.

An end-to-end ODI flow

Source data enters through documented access paths.
Data lands in open table formats with table metadata and transaction history.
A catalog stores and serves table state, namespaces, credentials, and operations.
Governance services enforce access and record audit events.
Lineage and quality systems capture signals as work happens.
Compute engines read and write through the shared contracts.
AI systems retrieve data with policy, source, freshness, and meaning attached.

That is the architecture. Not a single product. Not a diagram with eight logos. A set of contracts that keep the organization in control.

Sources to start with

Read the Stack Catalogs as Control Plane Use the Scorecard

Get started with Apache Iceberg, today! Want to learn more? Visit https://www.opendatainfrastructure.com/