Updated on 2024-12-13 GMT+08:00

Configuring Distinct Aggregation Optimization

This section is available for MRS 3.3.1-LTS or later version only.

Scenario

In SQL statements featuring multiple count(distinct) aggregation functions alongside operators that induce data expansion, such as cube and rollup, enabling this feature can significantly reduce the data multiplication factor. This optimization minimizes the amount of data shuffled to disk, thereby enhancing performance. Once enabled, the count(distinct) operator's implementation transitions from an "expand+multi-round aggregation" to a streamlined "count_distinct" aggregation function.

Constraints

Sufficient memory has been configured for the job.

Configuring Parameters

Modify the following parameters in the Client installation directory/Spark/spark/conf/spark-defaults.conf file on the Spark client.

Parameter

Description

Default Value

spark.sql.keep.distinct.expandThreshold

This parameter determines the threshold for activating this optimization in scenarios where data expansion due to cube operations is significant. Setting this parameter to a positive integer, such as 1024, triggers the optimization when the data volume expands by a factor of 1024 or more.

-1

spark.sql.distinct.aggregator.enabled

This parameter controls the enforcement of distinct aggregation optimization. With this function enabled, the distinct aggregation is restructured without being constrained by the data expansion factor. Use this setting only when you ensure it is beneficial.

false