Data Contracts for Streaming Lakehouse Pipelines

Batch contracts usually argue about columns. Streaming contracts also have to argue about time.

Streaming contracts are different

Apache Iceberg documents Flink support for streaming writes, and Flink documents checkpoints and savepoints for stateful streaming operations. DataHub and dbt document contract concepts around assertions, schema, and producer commitments.

Streaming lakehouse pipelines need contracts that cover more than schema. They need event time, processing time, watermark assumptions, deduplication, ordering, late data, replay behavior, policy status, and freshness expectations.

What the contract must say

A useful contract names the event identity, event-time field, allowed lateness, duplicate handling, schema compatibility, partitioning assumptions, policy rules, replay window, owner, and consumer impact of changes.

The contract should also state what happens during recovery. If a savepoint restore or replay can change downstream table behavior, consumers need to know which guarantees still hold.

Core idea: A streaming contract is a promise about data, time, state, and recovery.

The ODI pipeline contract

Open Data Infrastructure makes the contract stronger by connecting streaming jobs, open table commits, catalog metadata, lineage, and data product SLAs. The contract should be visible to humans and machine consumers before they depend on the stream.

For adjacent context, read Flink and Iceberg streaming patterns, Flink savepoint governance, and semantic contracts in Open Data Infrastructure.

What breaks first

The schema is contracted, but event-time and lateness behavior are tribal knowledge.
Replay fixes one incident and creates duplicate or reordered downstream data.
Policy status is evaluated at ingest but not preserved in the table contract.
Consumers read the lakehouse table without knowing the streaming recovery window.

Questions to ask

Ask what the contract says about time, duplicates, ordering, replay, recovery, freshness, and policy. Ask whether those guarantees are tested and visible in catalog and lineage systems.

Streaming data products need contracts that can survive motion.

Sources to start with

These primary sources anchor the technical claims in this guide.

A streaming contract is where time becomes part of the data model.

ODI hub Article library Use the scorecard Streaming patterns Semantic contracts

Get started with Apache Iceberg, today! Want to learn more? Visit https://www.opendatainfrastructure.com/