Open Data Infrastructure
Managed vs Self-Hosted Open Catalog: How to Choose
The catalog is a control plane. The managed decision is really a decision about who owns identity, policy, and recovery when something breaks.
The catalog is not a metadata database. It is where your lakehouse decides who can do what, and when.
Why catalogs are different
Teams treat catalogs like passive registries. That is true only until you have multiple engines, multiple teams, and a real governance model. At that point the catalog becomes the system that coordinates table identity, credentials, namespaces, policies, and table operations.
This is why "managed vs self-hosted" matters more for catalogs than it does for many other components. You are choosing who owns the control plane.
Core idea: if the catalog is where policy and operations live, then ownership of the catalog is ownership of the boundary.
What you outsource when it is managed
When the catalog is managed, you usually outsource:
- availability: uptime, failover, scaling, and operational patching
- security operations: key rotation, endpoint hardening, vulnerability response
- credential vending complexity: integration patterns for multiple engines and clouds
- some audit surface: log retention and access pattern visibility
You do not outsource the consequences. When an incident happens, the blast radius is still yours. The difference is whether you can change the boundary quickly or you have to wait for a vendor process.
What you own when it is self-hosted
When you self-host, you own the same operational responsibilities. You also gain two forms of control that matter in ODI terms:
- control of the policy model: you can integrate it with your identity system and your governance approach without inheriting a vendor opinionated boundary
- control of the migration path: you can move catalog deployments, operators, and supporting systems without negotiating a hosted boundary
Self-hosting is not "free." It is a bet that catalog ownership is strategically valuable. For many buyers, it is.
Failure modes to plan for
Catalogs fail in ways that look boring until the day they are existential:
- auth drift: engines that used to connect stop connecting because of a credential rotation or a policy change
- namespace sprawl: the catalog becomes ungovernable because tenancy conventions were never defined
- policy fragmentation: the catalog enforces one model, the warehouse enforces another, and teams build shadow access paths
- recovery ambiguity: a rollback or delete goes wrong and nobody can prove what the canonical state should be
Those are not edge cases. They are the normal shape of operating a real platform.
A decision framework
This framework tends to produce a sane decision:
- Start with your blast radius: how many engines, teams, and business-critical workflows depend on the catalog boundary?
- Decide what you must control: identity integration, policy evaluation model, audit requirements, and disaster recovery posture.
- Pick a protocol story: for Iceberg catalogs, a REST-compatible interface is the baseline interoperability requirement.
- Pick an operational story: managed can be correct if it still gives you a credible exit and a credible audit trail.
If you are not sure what "protocol story" means, start with What Is a REST Catalog?.
Sources to start with
Use the protocol docs and the project docs to anchor your evaluation in contracts, not claims.