Configuring Distinct Aggregation Optimization
This section is available for MRS 3.3.1-LTS or later version only.
Scenario
In SQL statements featuring multiple count(distinct) aggregation functions alongside operators that induce data expansion, such as cube and rollup, enabling this feature can significantly reduce the data multiplication factor. This optimization minimizes the amount of data shuffled to disk, thereby enhancing performance. Once enabled, the count(distinct) operator's implementation transitions from an "expand+multi-round aggregation" to a streamlined "count_distinct" aggregation function.
Constraints
Sufficient memory has been configured for the job.
Configuring Parameters
Parameter |
Description |
Default Value |
---|---|---|
spark.sql.keep.distinct.expandThreshold |
This parameter determines the threshold for activating this optimization in scenarios where data expansion due to cube operations is significant. Setting this parameter to a positive integer, such as 1024, triggers the optimization when the data volume expands by a factor of 1024 or more. |
-1 |
spark.sql.distinct.aggregator.enabled |
This parameter controls the enforcement of distinct aggregation optimization. With this function enabled, the distinct aggregation is restructured without being constrained by the data expansion factor. Use this setting only when you ensure it is beneficial. |
false |
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot