The Evolution of Open Data Infrastructure

Your data is yours. Or is it?

For most of data platform history, that wasn't true in any practical sense. Your data lived in formats only one vendor could read. You reached it through connectors only one vendor controlled. You queried it with engines only one vendor operated. Moving it somewhere else cost a small fortune in egress fees. You owned the data the way you own a song on a platform that can revoke it.

Open Data Infrastructure is the story of how we clawed that ownership back, one decoupling at a time. It happened in stages, and nobody planned the destination. Each stage pulled one more layer out of the vendor's locked box. Put the stages end to end and they point at the same place, which is the architecture AI turned out to need.

This is the history. If you want the definition and the buyer's lens first, start with the Open Data Infrastructure guide and the breakdown of what open data infrastructure actually means. If you want to understand how we got here, keep reading.

The monolith locked everything in one box

The relational database started as one tightly coupled system, and for good reason. ACID transactions (atomicity, consistency, isolation, durability, the guarantees that keep your bank balance correct) need sub-millisecond coordination between the buffer pool, the lock manager, the write-ahead log, and the storage engine. You can't spread those parts across a network and still keep the guarantees. Physics made the early database a monolith, not vendor greed.

Two storage designs grew out of that constraint. B+ trees, tuned for fast point lookups. LSM-trees, which turn random writes into sequential ones so disks can keep up. Both are good at transactions. Neither is good at analytics, because a row-oriented table reads the whole row even when you asked for one column, then throws away everything you didn't want. (Looking at you, every dashboard query that touches three columns out of a hundred.)

The deeper problem was ownership. Storage, compute, metadata, and the data itself all sat inside one product, controlled by one vendor. You didn't have data. You had a subscription to your data.

Decoupling storage from compute bought scale, not ownership

Cloud data warehouses made two bets that changed the shape of the industry. Columnar storage (hooray!) and compute disaggregated from storage (double hooray!!). Store each column together and you can compress it hard and read only what a query touches. Split compute from storage and you can scale them independently, paying for compute only while a query runs.

Snowflake's 2016 SIGMOD paper laid out the model: immutable micro-partitions in a proprietary columnar format, with compute clusters spun up against shared storage. Google BigQuery went serverless over its Colossus file system, storing data in a proprietary format called Capacitor. Both were real engineering leaps, and they solved the scaling problem completely.

Your data still wasn't yours. The files sat in formats no outside engine could open. You'd separated storage from compute inside a box you still didn't hold the key to. Better box. Same lock.

Decoupling the data bought ownership, and a swamp

The data lake flipped the order of operations. Put the raw files in open formats on commodity object storage first, then decide which engines read them. Apache Parquet, the open columnar file format that does dictionary, run-length, and delta encoding (routinely hitting 5 to 10x compression), meant any engine could read the bytes. Column pruning and predicate pushdown meant it could read them efficiently.

For the first time, ownership moved to the data owner. Multiple engines could query the same files. Nobody else held the key.

Then the guarantees fell apart. An ETL job that crashed halfway through writing a thousand Parquet files left five hundred orphans behind, with no transaction to roll back. No schema enforcement. No atomic commits. No safe concurrent writes. The industry coined a name for what the lake became once nobody minded the contracts. The data swamp.

Core idea: Warehouses had guarantees without ownership. Lakes had ownership without guarantees. Neither one was enough on its own.

Decoupling metadata finally bought both

The lakehouse closed the gap with a thin, open metadata layer sitting on top of the same lake files. That layer brought back ACID transactions, schema enforcement, and time travel, without giving up open storage or open access. The trick is a manifest that records which files belong to which version of a table, committed atomically. Optimistic concurrency control with atomic metadata commits, in the standard phrasing.

Three open table formats do this today: Apache Iceberg, Delta Lake, and Apache Hudi. A table format is the contract that tells an engine what a "table" actually is on top of a pile of files, including its schema, partitions, snapshots, and delete behavior. By 2026, Apache Iceberg has the broadest cross-vendor adoption of the three. Snowflake donated its Polaris catalog to the Apache Software Foundation. AWS shipped S3 Tables with Iceberg built in. Google added BigLake Iceberg tables. When your three largest competitors all support the same format, that format has stopped being a bet.

The catalog is where this gets real. The Iceberg REST Catalog specification, an OpenAPI HTTP interface for table discovery, namespaces, and operations, turns catalog integration from an N times M problem (every engine wired to every catalog by hand) into an N plus M one. That's the layer where openness either holds or quietly breaks, which is why catalogs are becoming the control plane for the whole architecture.

Open standards spread to every other layer

Storage and table formats opened first because the pain was loudest there. The same pressure hit every other layer, and the same move worked each time.

Apache Arrow standardized how engines hold columnar data in memory, so two systems can share data without serializing and deserializing it on the way through. Arrow Flight SQL pushes that to the network, with published benchmarks reporting 20x or better throughput over ODBC (the 1990s connectivity standard most tools still lean on). ADBC does for database connectivity what Arrow did for memory. Substrait gives engines a shared way to describe query plans across languages. dbt brought version control, testing, and code review to transformation, the same engineering discipline the rest of software already had. And standardized ingestion replaced the bespoke connector you used to rebuild for every SaaS app, database, and event stream.

Same move, layer after layer. Pull the contract out of the vendor's box and write it down somewhere everyone can read it. Open formats changed the data architecture from the storage up.

AI didn't change the requirements, it raised the stakes

None of this was built for AI. The open, composable architecture just turned out to be exactly what AI workloads need, and that's structural, not luck.

Training reads data hard. A single ML run streams millions of micro-batches per epoch at sustained throughput. With open formats, the cluster reads Parquet straight from object storage into Arrow memory arrays and skips the database overhead entirely. Closed storage that forces every read through one vendor's API becomes the bottleneck the moment the GPUs get hungry.

Reproducibility comes nearly free from the metadata layer. Every write creates an immutable snapshot with a unique ID and a timestamp. Tag a snapshot before training, and months later you can reproduce the exact run on the exact data, which is the difference between an experiment and a guess. Tools like PyIceberg, Ray Data, and Feast read these tables directly, no proprietary loader required.

Then there's pace. The AI stack changes faster than anything in recent memory. Infrastructure built on open standards lets you swap a query engine, a serving layer, or a vendor without migrating the data or rewriting the pipelines underneath. That's why AI-ready data infrastructure can't mean "we pointed a chatbot at the warehouse." It means access, governance, lineage, and context that travel with the data, which is the whole argument for why open data infrastructure matters in the AI era.

The principle outlasts the tools

Open is a design principle, not a product you buy. Your data is yours. Reach it through open APIs. Store it in open formats. Govern it with open metadata. Transform it with code you control. Query it with any engine you choose.

The tools that express this best today (Iceberg, dbt, Arrow, open connector protocols) will change. New ones will show up and some of these will fade. The principle holds because its value is structural. Flexibility, interoperability, robustness, and ownership aren't features on a vendor's roadmap. They're properties of the architecture itself.

AI needs access, flexibility, and interoperability. The people who own the data need ownership without lock-in. This is the moment those two needs finally converged, and the name for the place they meet is Open Data Infrastructure.

Primary sources

The standards and specifications behind the claims on this page, worth reading at the source rather than through a vendor's summary.

Read the Open Data Infrastructure guide See the open data infrastructure stack Open Data Infrastructure article library

Get started with Apache Iceberg, today! Want to learn more? Visit https://www.opendatainfrastructure.com/