Spark Distinct Aggregation Optimization

Scenarios

When SQL statements include multiple count(distinct) aggregation functions combined with data-expanding operators like cube and rollup, enabling this optimization can significantly reduce data duplication. By minimizing the volume of data shuffled to disk, query performance is greatly enhanced. Once enabled, the count(distinct) operator shifts from a resource-intensive "expand + multi-round aggregation" strategy to a streamlined count_distinct aggregation function, significantly improving efficiency.

Notes and Constraints

This section applies only to MRS 3.3.1-LTS or later.
Sufficient memory has been configured for the job.

Configuring Parameters

Install the Spark client.

For details, see Installing a Client.

Modify the following parameters in the {Client installation directory}/Spark/spark/conf/spark-defaults.conf file on the Spark client.

Parameter	Description	Example Value
spark.sql.keep.distinct.expandThreshold	Threshold (data multiplication factor) for activating this optimization in scenarios that involve data expansion due to operators like cube You need to set this parameter to a value greater than 0 to enable the optimization. For example, if this parameter is set to 1024, this optimization is triggered if the data expands by a factor of 1024 or more. If this parameter is set to -1, the optimization is disabled.	100
spark.sql.distinct.aggregator.enabled	Whether to forcibly enable distinct aggregation optimization. false: Forcible enforcement of distinct aggregation optimization is disabled. true: Forcibly enables distinct aggregation optimization, regardless of the data expansion factor. Spark restructures the distinct aggregation logic. Use this setting only when you ensure it is beneficial.	false

Parameter

Description

Example Value

spark.sql.keep.distinct.expandThreshold

Threshold (data multiplication factor) for activating this optimization in scenarios that involve data expansion due to operators like cube You need to set this parameter to a value greater than 0 to enable the optimization. For example, if this parameter is set to 1024, this optimization is triggered if the data expands by a factor of 1024 or more.

If this parameter is set to -1, the optimization is disabled.

100

spark.sql.distinct.aggregator.enabled

Whether to forcibly enable distinct aggregation optimization.

false: Forcible enforcement of distinct aggregation optimization is disabled.
true: Forcibly enables distinct aggregation optimization, regardless of the data expansion factor. Spark restructures the distinct aggregation logic. Use this setting only when you ensure it is beneficial.

false

Parent topic: Spark Core Enterprise-Class Enhancements

Previous topic: Configuring Spark Dynamic Masking

Next topic: Clearing Residual Files When a Spark Job Fails