Open Data Infrastructure
Data Modeling for RAG and Structured Retrieval
Why retrieval quality depends on entity design, grain, relationships, permissions, freshness, and structured context.
Vector search can find similar text. It cannot fix a data model that hides the entity, the grain, the permission, or the freshness of the thing being retrieved.
The practical problem
RAG systems often begin with chunking, embeddings, and a vector index. Those pieces matter. They are not a substitute for data modeling.
Structured retrieval works when the system understands entities, relationships, time, ownership, permissions, and source meaning. Without that structure, retrieval may return plausible context that is wrong for the user, the task, or the moment.
Core idea: RAG quality depends on modeling the context an answer needs, not only indexing the text an embedding can find.
The modeling work
Start with entity design. Customer, account, product, asset, policy, claim, contract, and incident are not interchangeable chunks. The retrieval system should know which entity is being discussed and how it relates to other entities.
Then define grain. A paragraph, record, event, table row, document, and metric can all describe the same business concept at different grains. Retrieval without grain awareness creates mixed evidence.
Finally, carry permissions and freshness. The answer should know whether context is allowed for this user and whether it is current enough for this task. If that sounds like data infrastructure, good. That is the point.
What breaks first
- Chunks from different entities look similar and get blended into one answer.
- Old policies, contracts, or runbooks outrank current ones because the index lacks freshness signals.
- Retrieval returns restricted context because permissions were checked after indexing instead of during access.
- Evaluation examples test answer tone but not entity correctness, source validity, or allowed use.
Questions to ask
Use these questions before tuning another retrieval parameter.
- Which entity and grain should each answer use?
- Which relationships help disambiguate similar records?
- Which permissions apply before retrieval and after retrieval?
- How does freshness affect ranking or exclusion?
- Which evaluation cases prove structured retrieval is working?
For related context architecture, read AI-Ready Context, Context Graph vs Knowledge Graph, and Retrieval Governance in ODI.
Sources to start with
Use evaluation guidance and provenance standards to make retrieval evidence testable.