Data Modeling for Entity-Centric Retrieval

Vector search is much less impressive when it retrieves the right paragraph for the wrong entity.

The practical problem

Retrieval systems often start with documents. Chunk the text, embed the chunks, query the index, and hope similarity finds the answer. That pattern breaks down when the user is asking about a customer, product, incident, invoice, account, or policy-bound object.

Entity-centric retrieval starts with the business object. What is the entity? What is the grain? Which identifiers are stable? Which relationships matter? Which facts are current? Which facts are allowed for this user and purpose?

The entity model is the retrieval model

A useful entity model defines keys, relationships, time boundaries, ownership, policy, and source authority. That model tells retrieval which context belongs together and which context should stay apart.

Without that model, retrieval can mix account-level and user-level facts, old and current records, public and restricted documents, or similar names from unrelated entities. The index may be technically healthy while the answer is semantically wrong.

Core idea: retrieval quality depends on entity modeling before it depends on vector tuning.

Structured context and vector context should cooperate

Vector search is still useful. It helps find language-level matches, examples, policies, and documents. But the retrieval path should use structured entity context to filter, join, rank, and explain what was retrieved.

This is why data modeling for RAG and structured retrieval is an ODI topic. The open data layer should expose meaning, policy, lineage, and freshness alongside text similarity.

What breaks first

Chunks mention an entity name but lack the stable entity ID.
Retriever results mix current facts with historical facts without time labels.
Policy context is applied after ranking, so restricted context shapes the candidate set.
The agent cannot explain which entity relationship justified the answer.

Questions to ask

What is the grain of the retrieved object?
Which identifier survives across systems?
Which relationships are allowed to expand retrieval?
Can the answer cite source entity, source version, and policy status?

Sources to start with

These primary sources anchor the technical claims in this guide.

Retrieval gets smarter when the data model stops being invisible.

ODI hub Article library Use the scorecard RAG data modeling Context vs knowledge graph AI-Ready Context Quality Tests

Get started with Apache Iceberg, today! Want to learn more? Visit https://www.opendatainfrastructure.com/