User-Defined Function APIs
Explicit Registration Syntax for UDFs
Explicit registration refers to the manual embedding of registration logic directly into Python code, typically achieved through methods like backend.register or register_from_file, where calling these methods triggers the registration process. This method relies on having access to a backend session object before proceeding.
Scenarios recommending explicit registration: If you wish to explicitly control registration timing, allow intrusive addition of registration logic, or require separation of UDF registration and usage under the same backend connection.
A common scenario involves one team handling UDF registration while multiple teams utilize these functions, with no shared Python scripts between them.
|
UDF Type |
UDF Type (Secondary) |
Registration Type (Tertiary) |
Code Entry |
Reference |
|---|---|---|---|---|
|
udf, udaf, udtf |
python |
Direct registration |
backend.[udf | udaf | udtf].python.register(<Registration function>, <Registration parameters>) |
|
|
File-based registration |
backend.[udf | udaf | udtf].python.register_from_file(<File path>, <Function name>, <Registration parameters>) |
|||
|
builtin |
Direct registration |
backend.[udf | udaf | udtf].builtin.register(<Registration function>, <Registration parameters>) |
||
|
File-based registration |
backend.[udf | udaf | udtf].builtin.register_from_file(<File path>, <Function name>, <Registration parameters>) |
|||
|
pyarrow |
Direct registration |
backend.udf.pyarrow.register(<Registration function>, <Registration parameters>) |
||
|
File-based registration |
backend.udf.pyarrow.register_from_file(<File path>, <Function name>, <Registration parameters>) |
|||
|
pandas |
Direct registration |
backend.udf.pandas.register(<Registration function>, <Registration parameters>) |
||
|
File-based registration |
backend.udf.pandas.register_from_file(<File path>, <Function name>, <Registration parameters>) |
Implicit Registration Syntax for UDFs
Implicit registration simplifies this process by using the Python runtime to automatically detect and register UDFs. Instead of embedding registration logic directly into the code, you can decorate your Python functions with @ decorators. Once decorated, these functions can be referenced in DataFrames using their original identifiers, seamlessly completing the registration. With implicit registration, there is no need to obtain a backend session object when applying the @ decorator. The backend session is only required later when working with Ibis DataFrames.
This approach is particularly useful when you aim to register UDFs without intrusive modifications to your code and do not require separation between the registration and usage of UDFs under the same Backend connection.
A common use case involves scenarios where your Python script includes both the registration and application of UDFs in a single workflow.
|
UDF Type |
UDF Type (Secondary) |
Code Entry |
Reference |
|---|---|---|---|
|
udf, udaf, udtf |
python |
@fabric.[udf | udaf | udtf].python(<Registration parameters>) |
|
|
builtin |
@fabric.[udf | udaf | udtf].builtin(<Registration parameters>) |
||
|
pyarrow |
@fabric.udf.pyarrow(<Registration parameters>) |
||
|
pandas |
@fabric.udf.pandas(<Registration parameters>) |
For implicit registration, the timing of the actual registration action differs depending on the DataFrame operation mode (Lazy or Eager).
As mentioned earlier in the Ibis official documentation, DataFrame operations are divided into Eager and Lazy modes, controlled by the ibis.options.interactive configuration item. By default, this is set to false, meaning all DataFrames operate in Lazy mode by default. For these two DataFrame execution modes, the timing of UDF registration varies as described below:
|
ibis.options.interactive |
DataFrame Execution Mode |
UDF Registration Time |
UDF Usage Time |
|---|---|---|---|
|
False |
Lazy |
When the entire DataFrame calls the execute method |
When the entire DataFrame calls the execute method |
|
True |
Eager |
First use in DataFrame |
Every use in DataFrame |
Python/PyArrow/Pandas UDF Registration Parameters
Registering a Python/PyArrow/Pandas UDF involves registering an original Python function or class into the database.
Whether it is explicit or implicit registration, whether it is scalar UDF, UDAF, or UDTF, for registering Python, PyArrow, or Pandas type UDFs, you can currently pass in the following parameters:
|
Registration Parameter |
Description |
Type |
Default Value |
|---|---|---|---|
|
name |
Specifies the actual storage name of the UDF in the database. |
str | None |
None |
|
database |
Specifies the LakeFormation database where the UDF resides. |
str | None |
None |
|
fn |
Specifies the original Python function of the UDF. |
Callable |
None |
|
signature |
Specifies the UDF function signature and return value type. |
fabric_data.ibis.common.annotations.Signature | None |
None |
|
replace (currently unavailable) |
Specifies whether the UDF supports in-place modification. |
bool |
False |
|
temporary (currently unavailable) |
Specifies whether the UDF has a session-level lifecycle. |
bool |
False |
|
if_not_exist (currently unavailable) |
Specifies whether to skip errors for existing UDFs. |
bool |
False |
|
strict |
Specifies whether the UDF automatically filters NULL values. |
bool |
True |
|
volatility |
Specifies the stability of the UDF. |
VolatilityType.VOLATILE | VolatilityType.STABLE | VolatilityType.IMMUTABLE |
VolatilityType.VOLATILE |
|
runtime_version (currently unavailable) |
Specifies the Python version for executing the UDF. |
str |
sys.version_info |
|
imports |
Specifies the external code files on which the UDF depends. |
List[str] |
None |
|
packages |
Specifies the Python modules on which the UDF depends. |
List[Union[str, module]] |
None |
|
register_type |
Specifies the registration form of the UDF. |
RegisterType.INLINE | RegisterType.STAGED |
RegisterType.INLINE |
|
comment |
Specifies user comments for the UDF. |
str | None |
None |
Precautions for Python/PyArrow/Pandas UDF Registration Parameters
- For the imports parameter, only file paths in the same directory or subdirectories as the .py file containing the current Python function or class are allowed.
- For the fn parameter, if fn is not in the .py file where the UDF is being registered, then the file path defining fn must also be added to the imports parameter, for example:
from process import outer con = ibis.fabric.connect(...) # Register a UDAF. udf = con.udaf.python.register( outer(), # fn introduced externally imports=["process.py"] # Add the file path for fn ) - The signature parameter is currently optional for you to provide. If you specify it, your input takes precedence over automatic inference. When no signature is provided by you, the system defaults to inferring the parameter and return value types automatically. For more details, refer to Type Inference of the signature Parameter.
- When registering a PyArrow UDF, whether you provide a signature parameter, it always relies on PyArrowVector. For example:
import fabric_data as fabric import pyarrow as pa import pyarrow.compute as pc # === With signature: Requires dependency on PyArrowVector. === def calculate_sum( prices: pa.ChunkedArray, quantities: pa.ChunkedArray, ) -> pa.ChunkedArray: return pc.multiply(prices, quantities) con = ibis.fabric.connect(...) # Register a UDF. udf = con.udf.pyarrow.register( fn=calculate_sum signature=fabric.Signature( parameters=[ fabric.Parameter(name="price", annotation=fabric.PyarrowVector[float]), fabric.Parameter(name="quantity", annotation=fabric.PyarrowVector[int]), ], return_annotation=fabric.PyarrowVector[float], ), ) # === Without signature: Also requires dependency on PyArrowVector. === def calculate_sum( prices: fabric.PyarrowVector[float], quantities: fabric.PyarrowVector[int], ) -> fabric.PyarrowVector[float]: return fabric.PyarrowVector[float](pc.multiply(prices, quantities)) con = ibis.fabric.connect(...) # Register a UDF. udf = con.udf.pyarrow.register( fn=calculate_sum ) - When registering a Pandas UDF, whether you provide a signature parameter, it always relies on PandasVector. For example:
import fabric_data as fabric import pandas as pd # === With signature: Requires dependency on PandasVector. === def calculate_sum( prices: pd.Series, quantities: pd.Series, ) -> pd.Series: return pd.Series(prices * quantities, dtype=pd.Float64Dtype()) con = ibis.fabric.connect(...) # Register a UDF. udf = con.udf.pandas.register( fn=calculate_sum signature=fabric.Signature( parameters=[ fabric.Parameter(name="price", annotation=fabric.PandasVector[float]), fabric.Parameter(name="quantity", annotation=fabric.PandasVector[int]), ], return_annotation=fabric.PandasVector[float], ), ) # === Without signature: Also requires dependency on PandasVector. === def calculate_sum( prices: fabric.PandasVector[float], quantities: fabric.PandasVector[int], ) -> fabric.PandasVector[float]: return fabric.PandasVector[float](prices * quantities, dtype=pd.Float64Dtype()) con = ibis.fabric.connect(...) # Register a UDF. udf = con.udf.pandas.register( fn=calculate_sum )
- When registering a PyArrow UDF, whether you provide a signature parameter, it always relies on PyArrowVector. For example:
- For the volatility parameter, the meanings of the three enumeration types are:
- VolatilityType.VOLATILE: Function results may change at any time.
- VolatilityType.STABLE: For fixed inputs, the function's result does not change during a single scan.
- VolatilityType.IMMUTABLE: The function always produces the same result for identical inputs.
The volatility parameter does not impact the execution of function pushdown. Python UDFs can be pushed down to DNs regardless of whether they are classified as IMMUTABLE, STABLE, or VOLATILE.
- If you do not specify the packages parameter:
- For PyArrow UDFs, the PyArrow version installed in the backend environment is automatically used.
- For Pandas UDFs, the Pandas version installed in the backend environment is automatically used.
Builtin UDF Registration Parameters
For Builtin UDFs, registering them simply means obtaining a handle for existing database functions—no actual registration occurs.
Whether it is explicit or implicit registration, whether it is scalar UDF, UDAF, or UDTF, for registering Builtin type UDFs, you can currently pass in the following parameters:
|
Registration Parameter |
Description |
Type |
Default Value |
|---|---|---|---|
|
name |
Specifies the actual storage name of the UDF in the database. |
str | None |
None |
|
database |
Specifies the LakeFormation database where the UDF resides. |
str | None |
None |
|
fn |
Specifies the original Python function of the UDF. |
Callable |
None |
|
signature |
Specifies the UDF function signature and return value type. |
ibis.common.annotations.Signature | None |
None |
Precautions for Builtin UDF Registration Parameters
The signature parameter is currently optional for you to provide. If you specify it, your input takes precedence over automatic inference. When no signature is provided by you, the system defaults to inferring the parameter and return value types automatically. For more details, refer to Type Inference of the signature Parameter.
Type Inference of the signature Parameter
For the signature parameter, you may choose to provide the parameter/return value types or omit them entirely.
- If you supply the signature parameter, there is no requirement for the original Python function to utilize type hinting syntax. This enables immediate operational registration of the UDF.
- Conversely, if the signature parameter is not provided, you are advised to use type hinting syntax within the original Python function, though this precludes immediate operational registration of the UDF.
A comparison of these approaches is summarized below.
|
signature Parameter |
Description |
Require Original Python Function with Type Hint Syntax |
Support Immediate REPL Operation |
|---|---|---|---|
|
User omits passing value |
Auto-deduction (recommended) |
No, yet recommended usage |
No |
|
User specifies value |
Specified value |
No |
Yes |
Here, immediate operation pertains to the read-evaluate-print loop (REPL), commonly encountered in Python's interactive terminal environment.
Introduced in Python 3.5 via PEP 484, type hinting syntax involves appending a colon (:) followed by the type after the parameter name and indicating the return type post the parameter list using an arrow (->), exemplified as follows:
def greet(name: str) -> str:
return f"Hello, {name}"
from typing import List, Dict, Optional
def process_data(data: List[int]) -> Dict[str, Optional[int]]:
return {"max": max(data) if data else None}
For Python/PyArrow/Pandas UDFs, strict typing is mandated upon registration, requiring explicit specification of all parameter and return value types. If you fail to define these through the original Python function's type hinting, you must actively use the signature parameter to designate the Ibis DataType.
In contrast, Builtin UDFs do not enforce strict typing during registration (as the UDF is already registered in the database). If you cannot specify the type annotations of the original Python function, you are advised to include only the parameter names without their types. If you later use the return value of the Builtin UDF (excluding Top SELECT UDF), then the function's return type needs to be specified, and when necessary, you should actively use the signature parameter to define the Ibis DataType. If not needed (for Top SELECT UDF), you may omit writing the function's return type.
Regarding cases where you do not provide the signature parameter and rely on auto-deduction, the following summary applies:
|
Registered UDF Type |
Parameter Type |
Return Type |
|---|---|---|
|
Python/Pyarrow/Pandas UDF |
Requires type hinting syntax for specification. |
Requires type hinting syntax for specification. |
|
Builtin UDF |
Allows writing just parameter names without types. |
Requires type hinting syntax when utilizing return values subsequently. Otherwise, not mandatory. |
For cases where you do not pass in the signature parameter and it is inferred automatically, the underlying implementation principle is inspect.signature. Currently, the system accepts the following parameter/return value types from you:
|
Python |
Ibis DataType |
DataArts Fabric SQL |
|---|---|---|
|
DataType |
DataType |
- |
|
type(None) |
null |
NULL |
|
bool |
Boolean |
BOOLEAN |
|
bytes |
Binary |
BYTEA |
|
str |
String |
TEXT |
|
numbers.Integral |
Int64 |
BIGINT |
|
numbers.Real |
Float64 |
DOUBLE PRECISION |
|
decimal.Decimal |
Decimal |
DECIMAL |
|
datetime.datetime |
Timestamp |
TIMESTAMP/TIMESTAMPTZ |
|
datetime.date |
Date |
TIMESTAMP |
|
datetime.time |
Time |
TIME |
|
datetime.timedelta |
Interval |
INTERVAL |
|
uuid.UUID |
UUID |
UUID |
|
class |
Struct |
STRUCT |
|
typing.Sequence, typing.Array |
Array |
ARRAY |
|
typing.Mapping, typing.Map |
Map |
HSTORE |
|
fabric_data.PyarrowVector[T] |
T |
T |
|
fabric_data.PandasVector[T] |
T |
T |
Notes:
- The built-in int type of Python belongs to the subclass of numbers.Integral.
- The built-in float type of Python belongs to the subclass of numbers.Real.
The Python types that are not listed in the preceding table are automatically converted types that are not supported currently.
For parameters/return values where you do not pass the signature parameter and also do not use Python type annotation (type hints) syntax, the current automatic inference adopts the following approach:
|
Parameter Type |
Generated Matching Pattern |
Pattern Effectiveness |
|---|---|---|
|
POSITIONAL_ONLY, KEYWORD_ONLY, POSITIONAL_OR_KEYWORD |
ValueOf(None) |
Exempts from __signature__.validate. |
|
VAR_POSITIONAL |
TupleOf(pattern=pattern) |
Executes pattern in a for-loop. |
|
VAR_KEYWORD |
DictOf(key_pattern=InstanceOf(str), value_pattern=pattern) |
Executes pattern in a for-loop. |
|
Return |
ValueOf(Unknown) |
Provides UnknowScaclar, UnknownColumn as UDF return values passed upward. |
The classification of parameter types (Parameter.kind) by inspect.signature is as follows:
|
Parameter Type |
Description |
Example Code |
Parameters Meeting Conditions |
|---|---|---|---|
|
POSITIONAL_ONLY |
Position-only parameter. |
def func(a, /, b): pass |
a |
|
KEYWORD_ONLY |
Keyword-only parameter. |
def func(a, *, b): pass |
b |
|
POSITIONAL_OR_KEYWORD |
Positional or keyword parameter. |
def func(a, b): pass |
a, b |
|
VAR_POSITIONAL |
Variable positional parameter. |
def func(*args): pass |
args |
|
VAR_KEYWORD |
Variable keyword parameter. |
def func(**kwargs): pass |
kwargs |
Direct Operation Syntax for UDFs
In scenarios where registration and usage are separate, the direct operation syntax for scalar UDFs, UDAFs, and UDTFs is provided for you. You only need to know the UDF name (name) and the database name (database) where the UDF resides to directly use the UDF. The following operations rely on the UDF attribute of the backend session object.
signature(name, database=None)
Description: Returns the function signature and return value type of the UDF from the backend database.
Input parameters:
- name (str): UDF name.
- database (str): Name of the database the UDF belongs to.
Return type: fabric_data.ibis.common.annotations.Signature - Registered UDF's signature and return type.
get(name, database=None)
Description: Returns the UDF from the backend database.
Input parameters:
- name (str): UDF name.
- database (str): Name of the database the UDF belongs to.
Return type: Callable[..., ibis.expr.types.Value] - Registered UDFs.
names(database=None)
Description: Returns the names of all UDFs from the backend database.
Input parameters:
- database (str): Name of the database the UDF belongs to.
Return type: List[str] - Names of all registered UDFs.
unregister(name, database=None)
Description: Deletes a specified UDF from the backend database.
Input parameters:
- database (str): Name of the database the UDF belongs to.
Return type: None.
UDF WITH ARGUMENTS Syntax
Whether you use explicit registration syntax, implicit registration syntax to return a UDF operator, or directly operation syntax to access a registered UDF, all currently support passing arguments via the with_arguments method. These arguments fall into two categories:
- Special-purpose parameters: These parameters configure runtime resources, concurrency, and execution time limits for the UDF. Examples include concurrency, timeout, dpu, and apu. Refer to UDF runtime configuration list for details. All types of UDFs currently allow you to pass these configuration parameters using with_arguments.
- General parameters: These parameters initialize the state of the UDF during its setup phase. They can only be passed as scalar values if you have defined optional parameters in the __init__ method of class UDF, class UDTF, or UDAF. This allows for one-time initialization of internal states, which can then be reused multiple times.
|
Parameter |
Applicable UDF Type |
Description |
|---|---|---|
|
Special parameters (e.g., concurrency, timeout, dpu, apu) |
All UDF types, including scalar UDF, class UDF, vectorized UDF, UDTF, UDAF |
Configures runtime resources, concurrency, and execution time limits for the UDF. |
|
General parameters defined by the Python Class's __init__ method |
Class UDF types, including class UDF, class UDTF, UDAF |
Sets up initial state for the UDF, suitable for cases involving internal caching, initialization parameters, or model objects. |
All parameter values passed through the with_arguments method are scalar values. These values are collectively passed as a **kwargs dictionary in Python.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot