Open Data Infrastructure
Apache Flink CDC Into Iceberg and Paimon
Flink CDC landing patterns should be chosen by table contract, latency, update semantics, and governance, not by logo preference.
CDC into the lakehouse is where "open table format" meets the messiest part of real systems: updates, deletes, ordering, retries, and people expecting yesterday's dashboard to still reconcile.
The practical problem
Change data capture turns source-system changes into downstream data movement. Apache Flink is a common engine for that kind of streaming work. Iceberg and Paimon can both appear in the lakehouse landing zone, but they optimize for different operating assumptions.
The ODI question is not "which format wins?" The question is which table contract fits the workload and which governance signals survive the stream.
Core idea: CDC landing design should make update semantics, latency, recovery, and downstream table ownership explicit.
The ODI boundary
CDC crosses source systems, stream processing, storage, table formats, catalogs, and downstream engines. A failure in any boundary can show up as duplicate rows, missing deletes, stale reads, broken schema evolution, or a table nobody trusts.
Iceberg brings a widely adopted open table format with snapshots, schema evolution, deletes, and multi-engine access. Paimon focuses on streaming lakehouse patterns with Flink and supports CDC ingestion workflows. Both can be part of ODI if the surrounding catalog, policy, lineage, and operational model are designed deliberately.
Patterns that work
Use Iceberg when the downstream contract is multi-engine lakehouse interoperability and analytical consistency. Pay close attention to delete semantics, file layout, snapshot maintenance, and compaction. Streaming writes can create operational debt if maintenance is an afterthought.
Use Paimon when the workload needs a table design closer to streaming updates and Flink-first ingestion patterns. Its CDC ingestion documentation is explicit about synchronizing source changes and schema evolution patterns, which are the exact details teams need to evaluate.
Keep raw CDC, curated tables, and serving views separate. Raw change logs are evidence. Curated tables are contracts. Serving views are consumption products. Blurring those layers creates debugging pain with a streaming badge attached.
Failure modes
The first failure is ignoring delete behavior. Inserts are easy to demo. Updates and deletes decide whether the table represents reality.
The second failure is schema drift without governance. Source systems change, Flink jobs adapt or fail, and downstream users discover the change through a broken metric.
The third failure is maintenance debt. Small files, equality deletes, compaction, checkpointing, and metadata growth can turn a streaming success into an operational bill.
Questions to ask
- What are the source update and delete semantics?
- Which downstream engines must read the curated table?
- How are schema changes detected, approved, and propagated?
- What compaction and maintenance schedule keeps reads reliable?
- Can lineage connect the source change event to the table snapshot and consumer?
For related streaming material, read Flink and Iceberg Streaming Patterns, Apache Paimon and the Streaming Lakehouse, and Deletes, Updates, and CDC in Open Table Formats.
Sources to start with
Start with the primary docs. They are the contracts you can test against, not commentary about the contracts.