A streaming data product is not only a topic, a table, or a dashboard. It is a running system with memory.

Streaming lineage needs runtime state

Apache Flink documents checkpointing, savepoints, and stateful stream processing. Those mechanisms are central to recovery and operational correctness. For streaming data products, they are also lineage evidence.

Batch lineage often focuses on datasets and jobs. Streaming lineage needs more. It needs source offsets, checkpoint IDs, savepoints, sink commits, restart events, watermarks, and recovery decisions.

Checkpoints are evidence

A checkpoint should help answer which input range produced which output state, which job version ran, and what happened during recovery. A savepoint should help explain intentional operational changes, not only support migration or upgrade work.

That evidence matters when a data product promises freshness, exactly-once behavior, replayability, or incident recovery. Without it, the pipeline may recover while the business loses the ability to explain what happened.

Core idea: streaming data product lineage has to include runtime recovery evidence, not only dataset edges.

Data products need recovery context

Open Data Infrastructure should connect Flink checkpoint evidence to open table commits, catalog metadata, OpenLineage events, and data product SLAs. The lineage record should show not only that a job wrote a table, but which recovery path got it there.

For adjacent context, read Flink savepoint governance, Flink exactly-once reality, and streaming lakehouse data contracts.

What breaks first

  • Lineage records the pipeline name but not the checkpoint or recovery event.
  • Savepoints are treated as operator artifacts with no data product owner visibility.
  • A sink commit succeeds after recovery, but the incident record cannot show what was replayed.
  • Freshness metrics ignore the difference between normal delay and recovery delay.

Questions to ask

Ask which checkpoint state is retained, how recovery events are logged, and how source offsets map to sink commits. Ask whether the data product owner can explain a restart without reading raw job manager logs.

Sources to start with

These primary sources anchor the technical claims in this guide.

A stream is trustworthy when its recovery story is part of its lineage story.