UDF Overview

When Fabric SQL's built-in functions fall short for specific business scenarios, you can use custom user-defined functions (UDFs) to encapsulate their service logic. For detailed capabilities of UDFs, refer to UDFs.

The Fabric Data SDK supports the registration and usage of various types of UDFs, including Scalar UDF, Class UDF, Vectorized UDF, UDTF, and UDAF.

**Table 1** UDF types
Type	Input	Output	Description
Scalar UDF	Single row, multiple parameters, for example, (a, b, c)	Single row output, a single scalar value (e.g., int, float, string, date)	The most common form, processing only one row of data per call. Used for simple calculations, formatting, cleansing, or transformations (e.g., string cleansing, math operations, date conversion). Analogous to regular functions in Python.
Class UDF	Same as Scalar UDF: single row input	Single row output	An object-oriented programming (OOP) encapsulation of Scalar UDFs, suitable for scenarios involving internal caching, initialization parameters, or model objects. For example, loading a model, regex, or configuration once inside the function. The database initializes resources internally only once, avoiding reconstruction with each call.
Vectorized UDF	Batch input of rows, types include pandas.Series/pyarrow.ChunkedArray	Batch output of an equal number of rows in vectorized structure	The most efficient method for single-row transformation, designed for batch processing. Suitable for: vectorized mathematical computations, batch text processing, embedding handling, NLP tokenization. Benefits from SIMD/batch execution advantages, commonly used with Arrow and Pandas.
User defined table function (UDTF)	Single-row input	Multiple row outputs	A single input may produce multiple output rows. Similar to Python's yield. Used for splitting, expanding, or unpacking structured data. Examples: splitting JSON arrays, expanding multimodal data, audio/video segmentation, tokenization, explosive expansion.
User defined aggregate function (UDAF)	Multiple row inputs, possibly from different partitions	Single output	Aggregates a group of data: summation, statistics, clustering, embedding aggregation. Features lifecycle methods like accumulate, merge, and finish, supporting distributed execution.

UDF Registration Methods

There are two primary methods for registering a UDF: explicit registration and implicit registration.

**Table 2** UDF registration methods
Registration Method	Description	Dependent on Session Object	Intrusively Add Registration Logic	Use Case	Reference
Explicit registration	Code explicitly specifies the registration information of the UDF.	Yes	Yes	If you wish to precisely control the registration timing, allow intrusive addition of registration logic, or require separation between registration and usage of scalar UDFs under the same backend connection.	Explicit registration syntax for UDFs
Implicit registration	Automatically discovers and registers UDFs during runtime.	No	No	If you prefer non-intrusive registration of scalar UDFs and do not require separation between registration and usage of Scalar UDFs under the same backend connection.	Implicit registration syntax for UDFs

UDF Registration Types

Whether using explicit or implicit registration, the meaning of registration differs depending on the UDF type. Details are provided in the table below.

**Table 3** UDF registration types
UDF Registration Type	Description	Vectorized	Use Case and Feature	Reference
Python	Registers a raw Python function or class into the database.	No	Processes data row by row, suitable for simple or specific calculations, but with lower performance.	Python/PyArrow/Pandas UDF registration parameters
Builtin	Obtains the handle of an existing function in the database; no actual registration operation is performed.	No	Directly calls existing functions from the database backend, suitable for using native database functions.	Builtin UDF registration parameters
PyArrow	Registers a Python function or class that accepts pyarrow.ChunkedArray as input and output with the database.	Yes	Uses Pandas' vectorized operations, ideal for performing complex data processing at the Python level.	Python/PyArrow/Pandas UDF registration parameters
Pandas	Registers a Python function or class that accepts pandas.Series as input and output with the database.	Yes	Uses PyArrow's high-performance computing capabilities, perfect for handling large datasets or requiring efficient computations.	Python/PyArrow/Pandas UDF registration parameters

Scalar UDFs currently support all the preceding four registration types.

For UDAFs and UDTFs, only Python and builtin registration types are supported.

Additional Arguments During UDF Runtime

Regarding runtime parameters for UDFs, both explicit and implicit registrations allow additional arguments to be passed using the with_arguments method. These parameters include concurrency, min_concurrency, max_concurrency, timeout, dpu, and apu. They enable fine-grained resource allocation, concurrency control, and execution time limits for UDFs.

For detailed usage, refer to UDF WITH ARGUMENTS Syntax.

Parent topic: User-Defined Functions

Previous topic: User-Defined Functions

Next topic: Scalar UDF