You do not really adopt a table format when you write your first table. You adopt it when you keep that table healthy for three years.

Why maintenance is part of the contract

Table formats like Iceberg make data portable by defining the table contract in metadata. That contract includes evolution, deletes, and time travel. None of that stays cheap if the table turns into a landfill of tiny files, stale snapshots, and orphaned objects.

Core idea: portability without hygiene becomes an operational tax that pushes teams back toward closed platforms.

If you are new to the format layer, start with What Is a Table Format? and Time Travel in Data Systems.

What maintenance actually includes

Maintenance is not one job. It is a set of recurring responsibilities:

  • file compaction: merging small files into healthier sizes for scan efficiency
  • rewrite planning: adjusting file layout for partition changes or clustering strategies
  • snapshot expiration: trimming history based on retention requirements
  • orphan cleanup: removing unreferenced files safely
  • metadata hygiene: keeping manifests, stats, and catalog metadata at a manageable scale

The easiest mistake is treating maintenance as an optimization project. In production, it is a reliability requirement.

Compaction: file size and write shape

Compaction is about undoing the physical consequences of how you write. Streaming jobs that checkpoint frequently, batch jobs that parallelize aggressively, and retries that write duplicates all create small files. Small files are not a moral failing. They are a predictable result of distributed writes.

The practical goal is to keep table scans efficient without breaking the table contract. Compaction decisions should consider:

  • target file size ranges per workload
  • how deletes are represented (delete files vs deletion vectors, when available)
  • how partition evolution affects physical layout
  • how to schedule compaction to avoid interfering with critical reads

Snapshot expiration and retention

Time travel is useful until it is expensive. Snapshot retention should be a policy decision, not an accident.

Define retention for:

  • compliance and audit requirements
  • debugging needs (how far back do you need to reproduce incidents?)
  • cost constraints (metadata and storage growth is real)

If you have not defined retention, you have already defined it. You defined it as "keep everything until something breaks."

Orphan file cleanup and correctness

Orphan files happen. Failed jobs write files that never get committed. Jobs get aborted. Temporary staging paths get left behind. Cleaning up orphans is important, but it is also dangerous if you do it without a safe procedure.

Do not treat orphan cleanup as "delete anything old." Treat it as a controlled operation tied to the table format's understanding of what is referenced.

How to automate it without chaos

Automation is the difference between healthy tables and heroic weekends. A practical approach:

  • run compaction and expiration as scheduled jobs with clear ownership
  • emit metrics and alerts when maintenance drifts (file counts, snapshot counts, metadata size)
  • create guardrails per table class (high-churn vs append-only)
  • treat maintenance changes like releases (test, canary, then expand)

If you run multiple engines, test maintenance behaviors across them. Portability fails when one engine relies on behavior another engine does not implement.

Maintenance checklist

  • Defined target file sizes and compaction cadence per workload
  • Defined snapshot retention policy with explicit owners
  • Automated orphan cleanup with conservative safety windows
  • Metrics for file counts, snapshot counts, and metadata growth
  • Runbooks for maintenance failures and rollbacks

The goal is boring tables. Boring tables keep the platform open.

Sources to start with

Start with the format documentation, then validate procedures in the engine you actually operate.