Open Data Infrastructure
The Open Data Infrastructure Stack
A practical map of the layers that make enterprise data portable, governed, interoperable, and ready for AI systems.
The open data infrastructure stack is not a shopping list. It is a control model for deciding which parts of your data platform must stay portable as the architecture changes.
The model
Most data stacks are drawn as boxes: ingestion, storage, warehouse, BI, AI, governance. Fine. Boxes are useful. They are also where architecture conversations go to nap.
The ODI stack is more useful when you think in contracts. Which layer owns data access? Which layer owns table state? Which layer owns identity and policy? Which layer gives AI systems context? If those contracts are hidden inside one vendor, you can still build a working system. You just do not have much control when something changes.
These layers didn't appear all at once. They're what the evolution of open data infrastructure produced as each part broke free of the vendor's box.
Core idea: the ODI stack is a portability map. It shows where lock-in hides.
The six ODI layers
- Access layer. Source systems expose data through APIs, CDC, exports, event streams, or replication paths that are documented and operationally durable.
- Storage and table layer. Data lands in formats and table structures that preserve schema, partitioning, snapshots, deletes, and table history across engines.
- Catalog layer. Catalogs coordinate table discovery, namespaces, metadata pointers, transactions, credentials, permissions, and engine interoperability.
- Compute layer. Query engines, transformation engines, stream processors, and embedded analytics systems can work against shared data contracts instead of private copies.
- Governance layer. Identity, policy, lineage, quality, observability, and auditability run in the path of work.
- Context layer. AI agents and applications retrieve governed data with metadata, lineage, freshness, definitions, and policy attached.
What belongs where
Layer boundaries matter because teams often ask one component to do the job of another. That is how a catalog becomes a search box, a table format becomes a governance strategy, and an AI tool becomes an over-permissioned query runner (hooray, incident review).
| Layer | Owns | Should not own |
|---|---|---|
| Access | Reliable movement from source systems. | Business semantics or downstream governance. |
| Storage and tables | Physical data, table state, snapshots, schema, and partition history. | Every policy decision in the enterprise. |
| Catalog | Table discovery, metadata pointers, operations, credentials, and policy coordination. | The full user experience for every data product. |
| Compute | Execution, query planning, transformation, and workload-specific optimization. | Exclusive ownership of data meaning. |
| Governance | Access control, lineage, quality, auditability, and trust signals. | Manual approval theater outside the workflow. |
| Context | Machine-usable meaning for applications and agents. | Guessing from raw table names and column names. |
The common mistakes
The first mistake is thinking open storage is enough. It is not. If the metadata, catalog, policies, and lineage are still trapped, the data is only partially open.
The second mistake is treating the catalog as a UI. Search matters, but the strategic value of a catalog is operational: table operations, metadata coordination, access patterns, credentials, and consistency across engines.
The third mistake is adding AI on top of a stack that cannot explain itself. Agents need context that is governed, traceable, and machine-readable. Otherwise, the model is just spelunking through your data estate with a flashlight and too much confidence.