A REST catalog outage is not a metadata inconvenience. It is the moment every engine asks who is allowed to touch which table, and nobody enjoys improvising that answer at 3 a.m.

The practical problem

The Iceberg REST catalog protocol gives engines a common API boundary for table discovery, namespace operations, commits, and metadata access. That is the correct direction for Open Data Infrastructure. It also concentrates a lot of operational risk in one service.

If the catalog is slow, unavailable, misconfigured, or handing out the wrong storage credentials, compute engines can stop working even while the underlying table data still exists. The runbook has to cover the catalog as a control plane, not a helper service.

Core idea: treat the REST catalog like production identity infrastructure for tables. The runbook should name the decision, the owner, the evidence, and the rollback path before the incident starts.

The runbooks that matter

Start with authentication failures. Operators need to know which identity provider, token issuer, catalog role, storage credential, and engine configuration are involved. If every symptom becomes "catalog access denied," the team has not instrumented the boundary.

Namespace drift deserves its own runbook. A namespace rename, environment promotion, or catalog migration can break table discovery without corrupting data. The recovery path is usually a mapping and ownership problem before it is a storage problem.

Commit conflict handling belongs in writing. Iceberg tables use metadata commits to preserve table state, and concurrent writers can collide. Operators need a policy for retry, backoff, write isolation, and escalation before teams start manually editing metadata pointers. Manual metadata surgery is where brave people become cautionary tales.

What breaks first

  • Broad storage credentials: the catalog approves a table operation, but the engine receives more object store access than the operation requires.
  • Silent client drift: one engine uses the REST catalog path while another uses a legacy catalog integration with different behavior.
  • Unowned catalog database: the table data is durable, but the catalog state that points to it has no tested restore process.
  • Rollback confusion: teams know Iceberg supports snapshots, but they do not know whether the incident calls for snapshot rollback, catalog restore, credential rotation, or application retry.

Questions to ask

Runbooks should answer operational questions in advance.

  • Which alerts prove the failure is catalog-side rather than engine-side?
  • Which identities and roles can perform namespace, table, and commit operations?
  • How are storage credentials scoped, rotated, and revoked?
  • Where is catalog state backed up, and how often is restore tested?
  • Which operations require snapshot rollback versus catalog restore?

For adjacent design, read The Open Data Infrastructure Stack, Catalogs as the Control Plane for ODI, and Disaster Recovery and Backup for the Open Lakehouse.

Sources to start with

Start with the protocol and operations docs, then test the runbooks against your engines.