Open Data Infrastructure
Flink Watermarks as Data Product Freshness Evidence
How Flink watermarks, event time, late data, and sink commits should feed freshness evidence for data products and agents.
A streaming table can be updated every minute and still be stale for the question that matters. Freshness depends on event time, not just when a job last ran.
Freshness is not wall-clock time
Apache Flink documents event time, processing time, and watermarks. Watermarks measure progress in event time, which is the time embedded in the data rather than the time on the machine processing it. That distinction matters for data products because a current pipeline can still be missing late events.
If a fraud feature, logistics dashboard, or agent workflow depends on a streaming lakehouse table, it needs to know more than the last successful write. It needs to know how far event time has advanced, how late data is handled, and which sink commit made the evidence visible.
Watermarks are evidence
A watermark is not a business SLA by itself. It is infrastructure evidence that helps define one. Teams can translate watermark lag, late-event policy, checkpoint behavior, and table commit state into freshness promises that consumers can understand.
That translation is the hard part. A dashboard may tolerate five minutes of late data. An agent making an operational decision may need stricter denial behavior when freshness falls behind.
Core idea: Flink watermarks should feed data product freshness evidence, not stay hidden inside streaming job internals.
The ODI freshness path
Open Data Infrastructure should connect Flink event-time progress to Iceberg table commits, catalog metadata, lineage, and data product contracts. The consumer should see whether the table is fresh enough for the use case, not just whether the pipeline is green.
For adjacent context, read Flink and Iceberg streaming patterns, Flink savepoint governance, and streaming lakehouse data contracts.
What breaks first
- Pipeline health uses processing time while consumers reason in event time.
- Late data policies live in job code but not in the data product contract.
- Watermark lag is monitored by platform teams but hidden from agent tools.
- Table commits succeed while freshness evidence remains ambiguous.
Questions to ask
Ask which event-time field defines freshness, how watermarks are generated, and what happens to late data. Ask whether the catalog exposes freshness status in language consumers and agents can use.
Fresh data is not data that arrived. Fresh data is data whose time contract still holds.
Sources to start with
These primary sources anchor the technical claims in this guide.
- Apache Flink timely stream processing documentation
- Apache Flink generating watermarks documentation
- Apache Iceberg Flink writes documentation
- OpenLineage object model documentation
Watermarks are where streaming freshness becomes inspectable.