site/docs/udf-spec.md
A SQL user-defined function (UDF or UDTF) is a callable routine that accepts input parameters and executes a function body. Depending on the function type, the result can be:
int, string) or a non-primitive type (e.g., struct, list).This specification introduces a standardized metadata format for UDFs in Iceberg.
UDF metadata follows the same design principles as Iceberg table and view metadata: each function is represented by a self-contained metadata file. Metadata captures definitions, parameters, return types, documentation, security, properties, and engine-specific representations.
The UDF metadata file has the following fields:
| Requirement | Field name | Type | Description |
|---|---|---|---|
| required | function-uuid | string | A UUID that identifies this UDF, generated once at creation. |
| required | format-version | int | UDF specification format version (must be 1). |
| required | definitions | list<definition> | List of function definition entities. |
| required | definition-log | list<definition-log> | History of versions within the function's definitions. |
| optional | location | string | The function's base location; used to create metadata file locations. |
| optional | properties | map<string,string> | A string-to-string map of properties. |
| optional | secure | boolean | Whether it is a secure function. Default: false. |
| optional | doc | string | Documentation string. |
Notes:
secure is set to true, engines should prevent leakage of sensitive information to end users. Each engine
may have its own security definition and mechanisms. It is the administrator's responsibility to ensure that
UDFs marked as secure are properly configured and protected in their environment.properties are treated as hints, not strict rules.Each definition represents one function signature (e.g., add_one(int) vs add_one(float)). A definition is uniquely
identified by its signature (the ordered list of parameter types). There can be only one definition for a given signature.
All versions within a definition must accept the same signature as specified in the definition's parameters field and
must produce values of the declared return-type.
| Requirement | Field name | Type | Description |
|---|---|---|---|
| required | definition-id | string | An identifier derived from canonical parameter-type tuple (see Definition ID). |
| required | parameters | list<parameter> | Ordered list of function parameters. Invocation order must match this list. |
| required | return-type | string or object | Declared return type using Types. |
| optional | return-nullable | boolean | A hint to indicate whether the return value is nullable or not. Default: true. |
| required | versions | list<definition-version> | Versioned implementations of this definition. |
| required | current-version-id | int | Identifier of the current version for this definition. |
| required | function-type | string ("udf" or "udtf") | If "udtf", return-type must be a struct (see Types) describing the output schema. |
| optional | doc | string | Documentation string. |
| Requirement | Field | Type | Description |
|---|---|---|---|
| required | type | string | Parameter data type (see Types). |
| required | name | string | Parameter name. |
| optional | doc | string | Parameter documentation. |
Notes:
<E> array_agg(E)).Types are based on the Iceberg Type.
Primitive and semi-structured type strings are encoded based on Iceberg Type JSON Representation
(e.g., int, string, timestamp, decimal(9,2), variant). Type strings must contain no spaces or quote characters.
Nested types (struct, list, map) use the Iceberg Type JSON Representation with the
following fields required. Any other fields must be ignored.
list requires type and element, e.g., { "type": "list", "element": "string" }map requires type, key, and value, e.g., { "type": "map", "key": "string", "value": "int" }struct requires type and fields, where each field requires name and type,
e.g., { "type": "struct", "fields": [ { "name": "id", "type": "int" }, { "name": "name", "type": "string" } ] }The definition-id is a canonical string derived from the parameter types, formatted as a comma-separated list with no
spaces. Each type uses the following string representation:
int, variant)list<element-type> (e.g., list<int>)map<key-type,value-type> (e.g., map<string,int>)struct<name1:type1,name2:type2,...> with field names and types (e.g., struct<id:int,name:string>)Examples of complete definition-id signatures:
int – single int parameterint,string – two parameters: int and stringint,list<int>,struct<id:int,name:string> – three parameters: an int, a list and a structEach definition can evolve over time by introducing new versions.
A definition version represents a specific implementation of that definition at a given point in time.
| Requirement | Field name | Type | Description |
|---|---|---|---|
| required | version-id | int | Monotonically increasing identifier of the definition version. |
| required | representations | list<representation> | UDF implementations. |
| optional | deterministic | boolean (default false) | Whether the function is deterministic. |
| optional | on-null-input | string ("return-null" or "call", default "call") | Defines how the UDF behaves when any input parameter is NULL. |
| required | timestamp-ms | long (unix epoch millis) | Creation timestamp of this version. |
on-null-input provides an optimization hint for query engines:
return-null, the function always returns NULL if any input argument is NULL. This allows engines to
apply predicate pushdown or skip function evaluation for rows with NULL inputs. For a function f(x, y) = x + y,
the engine can safely rewrite WHERE f(a,b) > 0 as WHERE a IS NOT NULL AND b IS NOT NULL AND f(a,b) > 0.call, the function may handle NULLs internally (e.g., COALESCE, NVL, IFNULL), so the engine must
execute the function even if some inputs are NULL.Each representation is an object with at least one common field, type, that is one of the following:
sql: a SQL expression that defines the function bodyRepresentations further define metadata for each type.
A definition version can have multiple SQL representations of different dialects, but only one SQL representation per dialect. The SQL representation stores the function body as a SQL expression, with metadata such as the SQL dialect.
| Requirement | Field name | Type | Description |
|---|---|---|---|
| required | type | string | Must be "sql" |
| required | dialect | string | SQL dialect identifier (e.g., "spark", "trino"). |
| required | sql | string | SQL expression text. |
Notes:
sql must reference parameters using the names declared in the definition's parameters field.| Requirement | Field name | Type | Description |
|---|---|---|---|
| required | timestamp-ms | long (unix epoch millis) | Timestamp when the function was updated to use the definition versions. |
| required | definition-versions | list<struct<definition-id:string,version-id:int>> | Mapping of each definition to its selected version at this time. |
Selecting the definition of a function to use is delegated to engines, which may apply their own casting rules. However, engines should:
SQL statement:
-- Trino SQL
CREATE FUNCTION add_one(x INT COMMENT 'Input integer')
COMMENT 'Add one to the input integer'
RETURNS INT
RETURN x + 1;
-- Trino SQL
CREATE FUNCTION add_one(x FLOAT COMMENT 'Input float')
COMMENT 'Add one to the input float'
RETURNS FLOAT
RETURN x + 1.0;
-- Spark SQL
CREATE OR REPLACE FUNCTION add_one(x FLOAT)
RETURNS FLOAT
RETURN x + 1.0;
{
"function-uuid": "42fd3f91-bc10-41c1-8a52-92b57dd0a9b2",
"format-version": 1,
"definitions": [
{
"definition-id": "int",
"parameters": [
{
"name": "x", "type": "int", "doc": "Input integer"
}
],
"return-type": "int",
"function-type": "udf",
"doc": "Add one to the input integer",
"versions": [
{
"version-id": 1,
"deterministic": true,
"representations": [
{ "type": "sql", "dialect": "trino", "sql": "x + 2" }
],
"timestamp-ms": 1734507000123
},
{
"version-id": 2,
"deterministic": true,
"representations": [
{ "type": "sql", "dialect": "trino", "sql": "x + 1" },
{ "type": "sql", "dialect": "spark", "sql": "x + 1" }
],
"timestamp-ms": 1735507000124
}
],
"current-version-id": 2
},
{
"definition-id": "float",
"parameters": [
{
"name": "x", "type": "float", "doc": "Input float"
}
],
"return-type": "float",
"function-type": "udf",
"doc": "Add one to the input float",
"versions": [
{
"version-id": 1,
"deterministic": true,
"representations": [
{ "type": "sql", "dialect": "trino", "sql": "x + 1.0" }
],
"timestamp-ms": 1734507001123
}
],
"current-version-id": 1
}
],
"definition-log": [
{
"timestamp-ms": 1734507000123,
"definition-versions": [
{ "definition-id": "int", "version-id": 1 }
]
},
{
"timestamp-ms": 1734507001123,
"definition-versions": [
{ "definition-id": "int", "version-id": 1 },
{ "definition-id": "float", "version-id": 1 }
]
},
{
"timestamp-ms": 1735507000124,
"definition-versions": [
{ "definition-id": "int", "version-id": 2 },
{ "definition-id": "float", "version-id": 1 }
]
}
],
"doc": "Overloaded scalar UDF for integer and float inputs",
"secure": false
}
SQL statement:
CREATE FUNCTION fruits_by_color(c VARCHAR COMMENT 'Color of fruits')
COMMENT 'Return fruits of specific color from fruits table'
RETURNS TABLE (name VARCHAR, color VARCHAR)
RETURN SELECT name, color FROM fruits WHERE color = c;
{
"function-uuid": "8a7fa39a-6d8f-4a2f-9d8d-3f3a8f3c2a10",
"format-version": 1,
"definitions": [
{
"definition-id": "string",
"parameters": [
{
"name": "c", "type": "string", "doc": "Color of fruits"
}
],
"return-type": {
"type": "struct",
"fields": [
{ "name": "name", "type": "string" },
{ "name": "color", "type": "string" }
]
},
"function-type": "udtf",
"doc": "Return fruits of a specific color from the fruits table",
"versions": [
{
"version-id": 1,
"deterministic": true,
"representations": [
{ "type": "sql", "dialect": "trino", "sql": "SELECT name, color FROM fruits WHERE color = c" },
{ "type": "sql", "dialect": "spark", "sql": "SELECT name, color FROM fruits WHERE color = c" }
],
"timestamp-ms": 1734508000123
}
],
"current-version-id": 1
}
],
"definition-log": [
{
"timestamp-ms": 1734508000123,
"definition-versions": [
{ "definition-id": "string", "version-id": 1 }
]
}
],
"doc": "UDTF returning (name, color) rows filtered by the given color",
"secure": false
}