Open Data Infrastructure Stack

The open data infrastructure stack is not a shopping list. It is a control model for deciding which parts of your data platform must stay portable as the architecture changes.

The model

Most data stacks are drawn as boxes: ingestion, storage, warehouse, BI, AI, governance. Fine. Boxes are useful. They are also where architecture conversations go to nap.

The ODI stack is more useful when you think in contracts. Which layer owns data access? Which layer owns table state? Which layer owns identity and policy? Which layer gives AI systems context? If those contracts are hidden inside one vendor, you can still build a working system. You just do not have much control when something changes.

These layers didn't appear all at once. They're what the evolution of open data infrastructure produced as each part broke free of the vendor's box.

Core idea: the ODI stack is a portability map. It shows where lock-in hides.

The six ODI layers

Access layer. Source systems expose data through APIs, CDC, exports, event streams, or replication paths that are documented and operationally durable.
Storage and table layer. Data lands in formats and table structures that preserve schema, partitioning, snapshots, deletes, and table history across engines.
Catalog layer. Catalogs coordinate table discovery, namespaces, metadata pointers, transactions, credentials, permissions, and engine interoperability.
Compute layer. Query engines, transformation engines, stream processors, and embedded analytics systems can work against shared data contracts instead of private copies.
Governance layer. Identity, policy, lineage, quality, observability, and auditability run in the path of work.
Context layer. AI agents and applications retrieve governed data with metadata, lineage, freshness, definitions, and policy attached.

What belongs where

Layer boundaries matter because teams often ask one component to do the job of another. That is how a catalog becomes a search box, a table format becomes a governance strategy, and an AI tool becomes an over-permissioned query runner (hooray, incident review).

Layer	Owns	Should not own
Access	Reliable movement from source systems.	Business semantics or downstream governance.
Storage and tables	Physical data, table state, snapshots, schema, and partition history.	Every policy decision in the enterprise.
Catalog	Table discovery, metadata pointers, operations, credentials, and policy coordination.	The full user experience for every data product.
Compute	Execution, query planning, transformation, and workload-specific optimization.	Exclusive ownership of data meaning.
Governance	Access control, lineage, quality, auditability, and trust signals.	Manual approval theater outside the workflow.
Context	Machine-usable meaning for applications and agents.	Guessing from raw table names and column names.

The common mistakes

The first mistake is thinking open storage is enough. It is not. If the metadata, catalog, policies, and lineage are still trapped, the data is only partially open.

The second mistake is treating the catalog as a UI. Search matters, but the strategic value of a catalog is operational: table operations, metadata coordination, access patterns, credentials, and consistency across engines.

The third mistake is adding AI on top of a stack that cannot explain itself. Agents need context that is governed, traceable, and machine-readable. Otherwise, the model is just spelunking through your data estate with a flashlight and too much confidence.

Sources to start with

See the Reference Architecture Catalogs as Control Plane Evaluate Your Stack

Get started with Apache Iceberg, today! Want to learn more? Visit https://www.opendatainfrastructure.com/