Adaptive Parallelism

Scenario

In AI data engineering scenarios involving massive data processing, it is necessary to call actors in parallel to enhance data processing efficiency and conduct distributed computing. However, the fixed level of parallelism relies on parameter-tuning experiences and repeated testing feedback. To improve user experience and reduce parameter-tuning time, adaptive parallelism is introduced to increase the ease of use of UDFs.

Constraints

Constraints on this feature:

Valid values for min_concurrency and max_concurrency must be specified when executing a UDF.
If the row count of the current batch is less than the computed parallelism level, the actual number of activated actors equals the row count. Otherwise, the actual number of activated actors equals the currently computed parallelism level.

When executing a UDF, setting both min_concurrency and max_concurrency parameters enables the adaptive parallelism feature.

The UDF begins executing the relevant data batch at the min_concurrency parallelism level. Each batch execution calculates the percentage of effective UDF computation time relative to the total time. When this percentage reaches 80% during a batch execution, the current parallelism level increases by steps of two up to the max_concurrency level, after which no further changes occur. If a round does not reach 80%, the parallelism level for the next round remains unchanged.

With the adaptive parallelism feature enabled for UDFs, significant variations exist in statistical information due to actors executing data across different batches sequentially. For details about how to view statistics in adaptive parallelism scenarios, see Python UDF Performance Tuning.

Parent topic: UDF Development (Python)

Previous topic: WITH ARGUMENTS Syntax