Open Data Infrastructure
Iceberg vs Parquet: They're Not the Same Thing
A clear, citable comparison of Apache Iceberg vs Apache Parquet, and why confusing table formats with file formats creates lock-in risk.
If someone tells you they are "moving to Iceberg," ask one question before you nod along. Are they changing a table contract, or are they just changing files?
Why it matters
Parquet is a file format. Iceberg is a table format. Parquet stores columns efficiently. Iceberg tells you which files belong to a table, which schema is valid, which snapshots exist, and how engines coordinate change without guessing.
Teams mix these up because both words show up in the same sentence. Iceberg tables often store data in Parquet files. That does not make Parquet a table format, and it does not make Iceberg "the new Parquet."
If you buy the wrong mental model, you buy the wrong architecture. You judge ownership by file format, while the real lock-in hides in catalog APIs, metadata semantics, and operational contracts.
The ODI angle
ODI is not about picking the right file format. It is about making your data contract portable across engines, teams, and vendors.
Iceberg matters because it defines a table-level contract on top of immutable data files. That contract is what lets multiple engines participate without rewriting everything.
Parquet matters because it is a common, open representation of columnar data. It is the material your table format shapes into something reliable.
Core idea: openness is a contract. File openness is not enough if the table contract and catalog boundary are trapped.
The architecture test
For data practitioners, the test is direct. Can another engine join the party without a migration project?
- Decide where the table contract lives (format spec, catalog, engine), then write it down.
- Treat the catalog API as part of the table contract, not a convenience layer.
- Test schema evolution and partition evolution with real workloads.
- Prove multi-engine reads (Spark and Trino, or Flink and DuckDB) before you commit.
- Plan for deletes and updates. If your story stops at append-only, you are not done.
What breaks first
This usually fails when teams pretend files are the contract.
- The team says it is Iceberg but still relies on manual file listing and partition guessing.
- You can read the Parquet files, but you cannot reproduce the table state behind a metric.
- A vendor "supports Iceberg" but only through a proprietary catalog boundary.
- Time travel exists in demos, but nobody can roll back a bad write with confidence.
Questions to ask
Use these questions when you evaluate Iceberg vs Parquet in a real platform decision.
- Which Iceberg catalog are you using, and can a second engine use it without custom glue?
- Which parts of your contract are in the Iceberg spec vs in your platform's catalog semantics?
- Can you explain snapshot retention and the rollback procedure in operational terms?
- How do you handle deletes and updates, including compaction?
- If you export Parquet files, what table semantics do you lose immediately?
If the answer depends on a custom export, a private metadata model, or a single execution engine, the system can still be useful. It just is not as open as the slide says.
Sources to start with
Start with primary sources. You can argue with opinions. You cannot argue with a spec.