Open Data Infrastructure
Data Contracts for Streaming Lakehouse Pipelines
How streaming lakehouse data contracts should cover schema, event time, deduplication, ordering, late data, policy, and replay.
Batch contracts usually argue about columns. Streaming contracts also have to argue about time.
Streaming contracts are different
Apache Iceberg documents Flink support for streaming writes, and Flink documents checkpoints and savepoints for stateful streaming operations. DataHub and dbt document contract concepts around assertions, schema, and producer commitments.
Streaming lakehouse pipelines need contracts that cover more than schema. They need event time, processing time, watermark assumptions, deduplication, ordering, late data, replay behavior, policy status, and freshness expectations.
What the contract must say
A useful contract names the event identity, event-time field, allowed lateness, duplicate handling, schema compatibility, partitioning assumptions, policy rules, replay window, owner, and consumer impact of changes.
The contract should also state what happens during recovery. If a savepoint restore or replay can change downstream table behavior, consumers need to know which guarantees still hold.
Core idea: A streaming contract is a promise about data, time, state, and recovery.
The ODI pipeline contract
Open Data Infrastructure makes the contract stronger by connecting streaming jobs, open table commits, catalog metadata, lineage, and data product SLAs. The contract should be visible to humans and machine consumers before they depend on the stream.
For adjacent context, read Flink and Iceberg streaming patterns, Flink savepoint governance, and semantic contracts in Open Data Infrastructure.
What breaks first
- The schema is contracted, but event-time and lateness behavior are tribal knowledge.
- Replay fixes one incident and creates duplicate or reordered downstream data.
- Policy status is evaluated at ingest but not preserved in the table contract.
- Consumers read the lakehouse table without knowing the streaming recovery window.
Questions to ask
Ask what the contract says about time, duplicates, ordering, replay, recovery, freshness, and policy. Ask whether those guarantees are tested and visible in catalog and lineage systems.
Streaming data products need contracts that can survive motion.
Sources to start with
These primary sources anchor the technical claims in this guide.
- Apache Iceberg Flink writes documentation
- Apache Flink savepoints documentation
- Apache Flink checkpoints and savepoints documentation
- DataHub data contracts documentation
- dbt model contracts documentation
- OpenLineage object model documentation
A streaming contract is where time becomes part of the data model.