The Role of Apache Avro in Open Data Infrastructure

Not every open data infrastructure problem is a columnar-table problem.

Avro is easy to undercount

Columnar formats get the attention in analytics because scan performance is visible. Parquet became the default mental model for data lakes for a reason. But ODI is not only about analytical scans. It is about preserving data contracts across systems.

Apache Avro still matters because it puts schema and serialization close to records, messages, and metadata paths. That makes it useful in event streams, RPC-style payloads, schema evolution workflows, and some table-format internals.

Core idea: Avro is a contract format for records. Parquet is a workhorse for analytical columns. ODI needs both patterns.

What Avro is good at

Avro schemas are defined in JSON and describe primitive and complex types. The specification includes schema resolution behavior, logical types, records, enums, arrays, maps, unions, and fixed types. That schema-first design is the point.

Avro works well when producers and consumers need to exchange records while preserving structure. It is especially common around event streams, schema registries, and systems where the reader and writer may evolve at different speeds.

Where Avro shows up in the lakehouse

In open lakehouse systems, Avro often appears in less visible places. Iceberg metadata uses Avro for manifest files. Engines may support Avro data files through table formats. Trino documents Avro as one of the data file formats supported by its Iceberg connector.

That does not make Avro a replacement for Parquet. It makes Avro part of the interoperability substrate. The lakehouse needs columnar data files for analytics and row-oriented schemas where records and metadata need a portable representation.

Events are the natural Avro home

Event-driven systems care about schema evolution. A producer adds a field. A consumer reads an older schema. Another service expects a logical type. These are not abstract concerns. They decide whether an event stream can change without breaking the downstream estate.

Avro gives teams a precise way to reason about those changes. In ODI terms, that means the event contract can feed open tables without losing its original meaning.

Use Avro where records need to travel

A practical rule works well:

Use Parquet or ORC for analytical table data where columnar scans matter.
Use Avro where record schemas, event payloads, or metadata interchange matter.
Use an open table format to define table-level behavior above the file format.

Confusing those layers creates bad arguments. Avro is not losing because Parquet is great. They do different jobs.

Sources to start with

Use the Avro specification for serialization behavior and table-format docs for where Avro appears inside lakehouse metadata and data files.

ODI hub Article library Use the scorecard Table vs file format Table formats

Get started with Apache Iceberg, today! Want to learn more? Visit https://www.opendatainfrastructure.com/