Open Data Infrastructure
Running the Open Lakehouse in Production
A lakehouse is not "open" because it uses open formats. It is open when it stays operable and governable after the second engine, the first incident, and the first compliance audit.
The hardest part of the open lakehouse is not the architecture diagram. It is the 3am incident where you have to prove what happened and roll it back.
Why production breaks the dream
In a demo, the open lakehouse looks like a simple stack: object storage, a table format, a catalog, and an engine. In production, it becomes a contract between teams. The moment you have multiple writers, multiple engines, or real governance, every implicit assumption turns into an outage.
Core idea: operational portability is what makes open storage valuable. If you cannot operate it safely, you cannot keep it open.
If your team is still working on basic definitions, start with What Is a Lakehouse, Really? and Table Format vs Catalog vs Query Engine.
Define ownership boundaries
Before you tune performance, decide who owns what. A credible production posture answers these questions:
- Who owns the catalog boundary (identity, policies, audit logs, and table operations)?
- Who owns storage layout decisions and retention policy?
- Who owns table contracts (schema evolution, partition evolution, delete semantics, and naming conventions)?
- Who is on call for data freshness incidents and correctness incidents?
If the answer is "the platform team owns it all," you will get shadow systems. If the answer is "each team does what it wants," you will get inconsistent contracts and unreliable data products. The right answer is explicit boundaries with clear escalation paths.
Make observability a first-class feature
Production lakehouses fail in ways you cannot see if you only monitor query latency. You need observability that can answer: which job wrote which data, using which logic, at what time, with which inputs, and which downstream assets were affected?
Lineage and operational metadata are part of the contract, not optional frosting. If you are integrating lineage, OpenLineage is a good starting point for standardizing event capture across tools. OpenLineage documentation.
At minimum, standardize:
- table-level write audit (who wrote, when, what changed)
- freshness and lag signals for critical datasets
- data quality checks tied to incident response, not dashboards
- cost and resource consumption per workload
Treat table changes like releases
Most lakehouse incidents start as "a small change" that was not treated as a release. Table formats support evolution, but your organization still needs discipline.
Release practices that work:
- version schemas intentionally, and document breaking changes
- use safe promotion workflows (branching or write-audit-publish patterns) when available
- define a rollback path for every high-risk write
- keep table maintenance automated (compaction, snapshot expiration, orphan cleanup)
If you are dealing with delete-heavy workloads, do not skip the maintenance story. See Automating Table Maintenance and Compaction and Deletes, Updates, and CDC.
Put governance in the data path
Governance fails when it lives outside the system doing the work. A production lakehouse needs policy enforcement that happens where data is accessed, not only where data is discovered.
In practical terms, this means:
- identity is consistent across engines and services
- policy is evaluated for reads and writes, not only for UI clicks
- audit logs are preserved and queryable
- data sharing is designed as an explicit boundary
If your governance model is "we trust the warehouse UI," you will struggle when you add a second engine or an agent layer. See ODI for AI Agents for why.
Plan for cost and performance drift
Compute-storage separation changes the cost shape, but it also makes cost drift easier to hide. Production operations should include:
- workload-level cost allocation and anomaly detection
- file size and partition hygiene tracking (the table format does not save you from bad writes)
- regular performance regression tests for critical queries
- capacity planning for metadata growth (catalog and manifest scale is a real constraint)
If you treat the open lakehouse as "cheap storage plus SQL," you will get expensive incidents instead of expensive bills.
A production checklist
- Clear ownership for catalog, storage, and table contracts
- Lineage and audit events standardized across critical jobs
- Automated maintenance jobs with on-call escalation
- Defined rollback procedures for high-risk writes
- Policy enforcement in the data path, not only the UI
- Disaster recovery drills that include catalog and metadata recovery
The open lakehouse becomes production-ready when your incident response feels boring.
Sources to start with
Start with the table format and catalog specifications, then validate behavior in the engines you actually run.