Updated on 2025-12-19 GMT+08:00

UDF Overview

When Fabric SQL's built-in functions fall short for specific business scenarios, you can use custom user-defined functions (UDFs) to encapsulate their service logic. For detailed capabilities of UDFs, refer to UDFs.

The Fabric Data SDK supports the registration and usage of various types of UDFs, including Scalar UDF, Class UDF, Vectorized UDF, UDTF, and UDAF.

Table 1 UDF types

Type

Input

Output

Description

Scalar UDF

Single row, multiple parameters, for example, (a, b, c)

Single row output, a single scalar value (e.g., int, float, string, date)

The most common form, processing only one row of data per call. Used for simple calculations, formatting, cleansing, or transformations (e.g., string cleansing, math operations, date conversion). Analogous to regular functions in Python.

Class UDF

Same as Scalar UDF: single row input

Single row output

An object-oriented programming (OOP) encapsulation of Scalar UDFs, suitable for scenarios involving internal caching, initialization parameters, or model objects. For example, loading a model, regex, or configuration once inside the function. The database initializes resources internally only once, avoiding reconstruction with each call.

Vectorized UDF

Batch input of rows, types include pandas.Series/pyarrow.ChunkedArray

Batch output of an equal number of rows in vectorized structure

The most efficient method for single-row transformation, designed for batch processing. Suitable for: vectorized mathematical computations, batch text processing, embedding handling, NLP tokenization. Benefits from SIMD/batch execution advantages, commonly used with Arrow and Pandas.

User defined table function (UDTF)

Single-row input

Multiple row outputs

A single input may produce multiple output rows. Similar to Python's yield. Used for splitting, expanding, or unpacking structured data. Examples: splitting JSON arrays, expanding multimodal data, audio/video segmentation, tokenization, explosive expansion.

User defined aggregate function (UDAF)

Multiple row inputs, possibly from different partitions

Single output

Aggregates a group of data: summation, statistics, clustering, embedding aggregation. Features lifecycle methods like accumulate, merge, and finish, supporting distributed execution.

UDF Registration Methods

There are two primary methods for registering a UDF: explicit registration and implicit registration.

Table 2 UDF registration methods

Registration Method

Description

Dependent on Session Object

Intrusively Add Registration Logic

Use Case

Reference

Explicit registration

Code explicitly specifies the registration information of the UDF.

Yes

Yes

If you wish to precisely control the registration timing, allow intrusive addition of registration logic, or require separation between registration and usage of scalar UDFs under the same backend connection.

Explicit registration syntax for UDFs

Implicit registration

Automatically discovers and registers UDFs during runtime.

No

No

If you prefer non-intrusive registration of scalar UDFs and do not require separation between registration and usage of Scalar UDFs under the same backend connection.

Implicit registration syntax for UDFs

UDF Registration Types

Whether using explicit or implicit registration, the meaning of registration differs depending on the UDF type. Details are provided in the table below.

Table 3 UDF registration types

UDF Registration Type

Description

Vectorized

Use Case and Feature

Reference

Python

Registers a raw Python function or class into the database.

No

Processes data row by row, suitable for simple or specific calculations, but with lower performance.

Python/PyArrow/Pandas UDF registration parameters

Builtin

Obtains the handle of an existing function in the database; no actual registration operation is performed.

No

Directly calls existing functions from the database backend, suitable for using native database functions.

Builtin UDF registration parameters

PyArrow

Registers a Python function or class that accepts pyarrow.ChunkedArray as input and output with the database.

Yes

Uses Pandas' vectorized operations, ideal for performing complex data processing at the Python level.

Python/PyArrow/Pandas UDF registration parameters

Pandas

Registers a Python function or class that accepts pandas.Series as input and output with the database.

Yes

Uses PyArrow's high-performance computing capabilities, perfect for handling large datasets or requiring efficient computations.

Python/PyArrow/Pandas UDF registration parameters

Scalar UDFs currently support all the preceding four registration types.

For UDAFs and UDTFs, only Python and builtin registration types are supported.

Additional Arguments During UDF Runtime

Regarding runtime parameters for UDFs, both explicit and implicit registrations allow additional arguments to be passed using the with_arguments method. These parameters include concurrency, min_concurrency, max_concurrency, timeout, dpu, and apu. They enable fine-grained resource allocation, concurrency control, and execution time limits for UDFs.

For detailed usage, refer to UDF WITH ARGUMENTS Syntax.