UDF Overview
When Fabric SQL's built-in functions fall short for specific business scenarios, you can use custom user-defined functions (UDFs) to encapsulate their service logic. For detailed capabilities of UDFs, refer to UDFs.
The Fabric Data SDK supports the registration and usage of various types of UDFs, including Scalar UDF, Class UDF, Vectorized UDF, UDTF, and UDAF.
|
Type |
Input |
Output |
Description |
|---|---|---|---|
|
Scalar UDF |
Single row, multiple parameters, for example, (a, b, c) |
Single row output, a single scalar value (e.g., int, float, string, date) |
The most common form, processing only one row of data per call. Used for simple calculations, formatting, cleansing, or transformations (e.g., string cleansing, math operations, date conversion). Analogous to regular functions in Python. |
|
Class UDF |
Same as Scalar UDF: single row input |
Single row output |
An object-oriented programming (OOP) encapsulation of Scalar UDFs, suitable for scenarios involving internal caching, initialization parameters, or model objects. For example, loading a model, regex, or configuration once inside the function. The database initializes resources internally only once, avoiding reconstruction with each call. |
|
Vectorized UDF |
Batch input of rows, types include pandas.Series/pyarrow.ChunkedArray |
Batch output of an equal number of rows in vectorized structure |
The most efficient method for single-row transformation, designed for batch processing. Suitable for: vectorized mathematical computations, batch text processing, embedding handling, NLP tokenization. Benefits from SIMD/batch execution advantages, commonly used with Arrow and Pandas. |
|
User defined table function (UDTF) |
Single-row input |
Multiple row outputs |
A single input may produce multiple output rows. Similar to Python's yield. Used for splitting, expanding, or unpacking structured data. Examples: splitting JSON arrays, expanding multimodal data, audio/video segmentation, tokenization, explosive expansion. |
|
User defined aggregate function (UDAF) |
Multiple row inputs, possibly from different partitions |
Single output |
Aggregates a group of data: summation, statistics, clustering, embedding aggregation. Features lifecycle methods like accumulate, merge, and finish, supporting distributed execution. |
UDF Registration Methods
There are two primary methods for registering a UDF: explicit registration and implicit registration.
|
Registration Method |
Description |
Dependent on Session Object |
Intrusively Add Registration Logic |
Use Case |
Reference |
|---|---|---|---|---|---|
|
Explicit registration |
Code explicitly specifies the registration information of the UDF. |
Yes |
Yes |
If you wish to precisely control the registration timing, allow intrusive addition of registration logic, or require separation between registration and usage of scalar UDFs under the same backend connection. |
|
|
Implicit registration |
Automatically discovers and registers UDFs during runtime. |
No |
No |
If you prefer non-intrusive registration of scalar UDFs and do not require separation between registration and usage of Scalar UDFs under the same backend connection. |
UDF Registration Types
Whether using explicit or implicit registration, the meaning of registration differs depending on the UDF type. Details are provided in the table below.
|
UDF Registration Type |
Description |
Vectorized |
Use Case and Feature |
Reference |
|---|---|---|---|---|
|
Python |
Registers a raw Python function or class into the database. |
No |
Processes data row by row, suitable for simple or specific calculations, but with lower performance. |
|
|
Builtin |
Obtains the handle of an existing function in the database; no actual registration operation is performed. |
No |
Directly calls existing functions from the database backend, suitable for using native database functions. |
|
|
PyArrow |
Registers a Python function or class that accepts pyarrow.ChunkedArray as input and output with the database. |
Yes |
Uses Pandas' vectorized operations, ideal for performing complex data processing at the Python level. |
|
|
Pandas |
Registers a Python function or class that accepts pandas.Series as input and output with the database. |
Yes |
Uses PyArrow's high-performance computing capabilities, perfect for handling large datasets or requiring efficient computations. |
Scalar UDFs currently support all the preceding four registration types.
For UDAFs and UDTFs, only Python and builtin registration types are supported.
Additional Arguments During UDF Runtime
Regarding runtime parameters for UDFs, both explicit and implicit registrations allow additional arguments to be passed using the with_arguments method. These parameters include concurrency, min_concurrency, max_concurrency, timeout, dpu, and apu. They enable fine-grained resource allocation, concurrency control, and execution time limits for UDFs.
For detailed usage, refer to UDF WITH ARGUMENTS Syntax.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot