Open Data Infrastructure
SLAs and Reliability for Open Data Platforms
Data SLAs are not only about uptime. They are about freshness, correctness, access, cost, and recoverability across a multi-engine stack.
A data platform is reliable when incidents are rare, small, and explainable. Everything else is just a dashboard.
Why data SLAs are different
Application SLAs often reduce to availability and latency. Data SLAs are multidimensional. A dataset can be "up" and still be useless if it is stale, incomplete, or wrong. It can be correct and still be unusable if access fails. It can be fast and still be dangerous if nobody can explain lineage and ownership.
Core idea: reliability in ODI means the system stays trustworthy while components and engines change.
What you can actually promise
Strong SLAs are explicit about which promises are guaranteed and which are best-effort. Common SLA categories:
- freshness: data arrives within a defined time window
- completeness: expected records or partitions are present
- correctness: business rules and invariants hold
- access: authorized users and services can read and write reliably
- recoverability: rollback and restoration procedures meet RPO and RTO targets
- cost guardrails: runaway workloads are detected and contained
In an open lakehouse, these promises span storage, catalog, engines, orchestration, and governance. If your SLA only covers one layer, it is not a platform SLA.
Useful SLIs for a lakehouse
Practical SLIs (service level indicators) that map to real pain:
- freshness lag: time since last successful partition or snapshot update
- pipeline success rate: per-job success, retries, and time-to-recovery
- data contract violations: schema drift, null rate spikes, uniqueness breaks
- access failures: auth errors, permission denials, credential expiration incidents
- metadata health: file counts, manifest counts, snapshot counts, and catalog latency
- cost per workload: compute spend, storage growth, and anomaly detection
Reliability metrics should be tied to on-call ownership. If nobody is paged, the metric is decorative.
Who owns what when it breaks
Reliability is organizational. You need explicit ownership boundaries:
- platform team: catalog reliability, cross-engine access patterns, shared governance mechanisms
- data product teams: freshness and correctness for owned datasets, plus incident response runbooks
- security and compliance: policy enforcement requirements, audit log retention, and controls testing
If every incident becomes a cross-team argument about responsibility, your platform is not reliable. It is only loud.
Reliability anti-patterns
- warehouse-only SLAs: promising reliability based on one engine while other access paths remain unmanaged
- no rollback plan: treating table writes as irreversible and hoping nothing goes wrong
- lineage as a dashboard: capturing lineage after the fact instead of as part of execution
- best-effort governance: relying on a process document instead of enforcement in the data path
If you recognize those anti-patterns, start with Running the Open Lakehouse in Production.
A simple reliability playbook
A pragmatic reliability playbook for an open data platform:
- Define SLIs for freshness, correctness, access, and recovery.
- Set SLOs (service level objectives) and alert when you are at risk, not after you miss.
- Attach every SLO to an owner and an incident runbook.
- Standardize lineage and audit events so incidents are explainable.
- Run regular disaster recovery drills that include the catalog boundary.
The goal is not perfection. The goal is predictable failure and fast recovery.
Sources to start with
Use standards and specifications to define the contracts, then measure reliability against those contracts.