Open Data Infrastructure
Apache Flink to Iceberg: Streaming Patterns for ODI
Streaming into open tables is where "portable" becomes real. It forces you to choose semantics: append-only, upserts, deletes, compaction, and recovery.
Batch workloads forgive sloppy contracts. Streaming workloads punish them immediately.
Why streaming is the ODI stress test
Streaming ingestion is where an open lakehouse either becomes a durable platform or becomes a pile of files with optimism. Continuous writes amplify every weakness: small files, partial failures, schema drift, and the cost of enforcing deletes and updates at read time.
Core idea: if your streaming writes are not portable, your lakehouse is not open. It is only open at rest.
If you want a layer model before you jump into patterns, start with Table Format vs Catalog vs Query Engine.
Four patterns that cover most teams
Most teams end up in one of these patterns when writing from Flink into Iceberg:
- Append-only event tables: immutable events, partitioned by event time, with downstream derived tables handling aggregation and correction.
- Upsert tables (CDC semantics): a primary key and change events applied as updates and deletes, often driven by a log-based CDC tool upstream.
- Micro-batch streaming: streaming sources, but writes grouped into larger commits to reduce small file and metadata churn.
- Dual-table pattern: append-only raw events plus a curated upserted table for consumers who need current state.
The right pattern is not ideological. It depends on how often data changes, what consumers need, and how much operational complexity you are willing to own.
Commit and consistency realities
Streaming writers need clear consistency semantics. Iceberg tables provide snapshot-based isolation, but your pipeline still needs to decide how and when to commit.
Three practical questions:
- What is the atomic unit of correctness? one event, one partition, one micro-batch, or one time window?
- What is your retry behavior? how do you avoid duplicate writes and partial state when jobs restart?
- How do you coordinate with consumers? are consumers reading incremental changes, or only stable snapshots?
These are not only Flink questions. They are table contract questions. See Time Travel in Data Systems for the snapshot model that underlies the behavior.
Maintenance and compaction strategy
Streaming writes create small files. The solution is not wishful thinking. The solution is a maintenance plan.
Plan for:
- compaction cadence that keeps file sizes healthy
- retention policy that keeps snapshot history useful, not expensive
- cleanup procedures that handle orphan files safely
If you do not automate maintenance, you are outsourcing it to future outages. Start with Automating Table Maintenance and Compaction.
Governance and audit requirements
Streaming pipelines are also security and governance pipelines. The first time a job writes the wrong data, you will need to answer: who wrote it, why, and how it propagated.
In ODI terms, capture lineage and audit events as part of execution. OpenLineage is a useful standardization starting point. OpenLineage documentation.
A practical checklist
- Choose an explicit write pattern (append-only, upsert, micro-batch, or dual-table).
- Define the atomic unit of correctness and commit cadence.
- Test restart behavior and verify no duplicate or partial writes.
- Automate compaction and retention with metrics and alerts.
- Capture lineage and audit signals at write time.
A streaming lakehouse is production-ready when restarts feel routine, not terrifying.
Sources to start with
Start with the connector docs and the table format spec.