Vector Search on Open Table Formats

If your vector search stack cannot survive an engine change, you do not have an AI feature. You have a new kind of lock-in.

The portability trap in vector search

Vector search is often shipped as a warehouse feature or a managed service. It looks convenient until you realize the index is the product. The data is not only the embedding array. It is the mapping from an embedding back to a governed, auditable record.

When you cannot move that mapping and rebuild indexes independently, you are not owning your retrieval layer.

Core idea: retrieval is a data contract problem, not a database feature checklist.

What belongs in open tables

Open table formats are a good home for the durable artifacts:

Embeddings: vectors plus the primary key that maps back to the source record.
Chunk metadata: document IDs, chunk boundaries, and normalization decisions.
Provenance: which model produced the embedding, which version, and when.
Permissions signals: enough policy metadata to enforce access at query time.

If you store embeddings as a sidecar dataset with no lineage and no permission mapping, you have built the fastest possible path to an incident.

Where vector indexes should live

Vector indexes are derived artifacts. They can be built in different systems, but you should treat them like you treat materialized views or search indexes: reproducible, rebuildable, and disposable.

That means two things in practice:

Rebuild from open tables: you should be able to regenerate an index from Iceberg tables without special exports.
Index versioning: treat index builds as releases, with explicit versions and rollback plans.

The open table layer gives you the stable substrate. Engines and index implementations can change.

Governance and audit for retrieval

Retrieval is an access path. If you do not govern it, you have built a backdoor.

Policy alignment: retrieval results must respect the same permissions as direct queries.
Audit trails: record what was retrieved, why, and which data it was derived from.
Drift detection: embeddings and indexes drift as your corpus changes. Monitor it like you monitor schema change.

If you cannot answer "why did the model see this chunk," you are not ready for production retrieval.

Sources to start with

Start with the open table contract and metadata surfaces, then design retrieval as a governed access path.

ODI hub Article library Use the scorecard RAG and ODI AI-ready context

Get started with Apache Iceberg, today! Want to learn more? Visit https://www.opendatainfrastructure.com/