Spark Distinct Aggregation Optimization
Scenarios
When SQL statements include multiple count(distinct) aggregation functions combined with data-expanding operators like cube and rollup, enabling this optimization can significantly reduce data duplication. By minimizing the volume of data shuffled to disk, query performance is greatly enhanced. Once enabled, the count(distinct) operator shifts from a resource-intensive "expand + multi-round aggregation" strategy to a streamlined count_distinct aggregation function, significantly improving efficiency.
Notes and Constraints
- This section applies only to MRS 3.3.1-LTS or later.
- Sufficient memory has been configured for the job.
Configuring Parameters
- Install the Spark client.
For details, see Installing a Client.
- Log in to the Spark client node as the client installation user.
Modify the following parameters in the {Client installation directory}/Spark/spark/conf/spark-defaults.conf file on the Spark client.
Parameter
Description
Example Value
spark.sql.keep.distinct.expandThreshold
Threshold (data multiplication factor) for activating this optimization in scenarios that involve data expansion due to operators like cube You need to set this parameter to a value greater than 0 to enable the optimization. For example, if this parameter is set to 1024, this optimization is triggered if the data expands by a factor of 1024 or more.
If this parameter is set to -1, the optimization is disabled.
100
spark.sql.distinct.aggregator.enabled
Whether to forcibly enable distinct aggregation optimization.
- false: Forcible enforcement of distinct aggregation optimization is disabled.
- true: Forcibly enables distinct aggregation optimization, regardless of the data expansion factor. Spark restructures the distinct aggregation logic. Use this setting only when you ensure it is beneficial.
false
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot