Open Data Infrastructure
Text-to-SQL on the Open Lakehouse
Text-to-SQL works when your table contracts, metadata, and permissions are machine-readable. Otherwise it turns into a prompt lottery.
Most text-to-SQL demos succeed because the schema is tiny, the data is clean, and nobody cares about governance. Real lakehouses are the opposite.
Why text-to-SQL fails in production
Text-to-SQL fails when the model is asked to guess. Guess the right table, guess which joins are valid, guess which columns are sensitive, and guess what "active customer" means in your business.
If you have to rely on hints and hand-rolled prompt logic to keep the model on the rails, you are doing governance in the least reliable place in your stack.
Core idea: text-to-SQL is a metadata quality test disguised as an AI feature.
What the model actually needs
To produce correct SQL, the model needs more than a schema dump.
- Business semantics: definitions, ownership, and the meaning of key metrics.
- Join intent: which joins are valid, and which are technically possible but logically wrong.
- Freshness and lineage: what data is current, how it was produced, and what it depends on.
- Permissions: what the user is allowed to query, at row, column, and dataset scope.
That is not prompt engineering work. That is data infrastructure work.
A lakehouse architecture that makes it work
On an open lakehouse, the best text-to-SQL architecture starts with an open contract layer:
- Open tables: Iceberg tables as the stable storage contract.
- Open catalogs: a governed catalog surface that exposes table metadata, policy, and lineage consistently.
- SQL portability: a parsing and transpilation layer that can normalize dialect differences when engines vary.
This is why ODI matters. If text-to-SQL only works inside one proprietary warehouse, you have built a feature that cannot move with you.
Guardrails that are not prompt tricks
The safest guardrails are structural.
- Constrain the schema: expose only the tables and columns the user can actually query.
- Prefer views and data products: give the model curated query surfaces instead of raw lakehouse sprawl.
- Audit every query: record the natural language request, generated SQL, and result metadata for review.
- Fail closed: if the model cannot prove the query is safe, block it.
If you cannot enforce these guardrails at the data boundary, you are trusting the model to do access control. That is the wrong abstraction.
Sources to start with
Start with open table and catalog specifications, then add the portability and governance layers.