Open Data Infrastructure
DataFusion UDF Boundaries for Governed Query Services
How DataFusion UDF boundaries shape review, sandboxing, lineage, and policy checks in governed query services.
User-defined functions look like convenience until the query service becomes a control point. Then every custom function becomes part of governance.
Custom logic is a policy surface
Apache DataFusion supports user-defined scalar, window, aggregate, and table functions. That makes it a strong foundation for custom query services, embedded analytics, and data product APIs. It also means custom code can sit inside the same path that evaluates filters, projections, joins, and policy decisions.
A governed query service cannot treat UDFs as harmless extensions. A function can expose hidden data, hide expensive behavior, rewrite meaning, or create an output that lineage systems do not understand.
UDFs need an operating contract
The operating contract should name the function owner, input types, output type, determinism expectations, side-effect rules, data sensitivity rules, version, review status, and lineage behavior. If the function changes business meaning, it belongs near semantic contract review. If it changes access behavior, it belongs near policy review.
The service should also separate allowed functions from experimental ones. DataFusion gives the execution framework. The platform has to decide which functions are safe to expose to which users and agents.
Core idea: In a governed query service, a UDF is not only code. It is a contract about meaning, access, and evidence.
The ODI query service pattern
Open Data Infrastructure needs query services that can move across engines without losing controls. DataFusion logical plans can help expose query shape before execution. UDF metadata should travel with that plan so policy systems and lineage systems know when custom logic changes the answer.
For adjacent context, read DataFusion policy-aware query services, logical plans as policy evidence, and semantic contracts in ODI.
What breaks first
- A UDF receives columns that policy would have filtered in plain SQL.
- A function changes meaning while keeping the same name.
- Lineage records the table read but not the custom transformation applied.
- Agents call query tools that expose powerful functions without scope limits.
Questions to ask
Ask how functions are registered, reviewed, versioned, and retired. Ask whether query plans expose custom function use before execution. Ask whether lineage and audit records show function names and versions.
Custom query logic deserves the same governance as the data it touches.
Sources to start with
These primary sources anchor the technical claims in this guide.
- DataFusion adding user-defined functions documentation
- DataFusion functions documentation
- DataFusion query builder documentation
- OpenLineage facets documentation
The function boundary is where code becomes data governance.