Updated on 2025-08-22 GMT+08:00

Spark Distinct Aggregation Optimization

Scenarios

When SQL statements include multiple count(distinct) aggregation functions combined with data-expanding operators like cube and rollup, enabling this optimization can significantly reduce data duplication. By minimizing the volume of data shuffled to disk, query performance is greatly enhanced. Once enabled, the count(distinct) operator shifts from a resource-intensive "expand + multi-round aggregation" strategy to a streamlined count_distinct aggregation function, significantly improving efficiency.

Notes and Constraints

  • This section applies only to MRS 3.3.1-LTS or later.
  • Sufficient memory has been configured for the job.

Configuring Parameters

  1. Install the Spark client.

    For details, see Installing a Client.

  2. Log in to the Spark client node as the client installation user.

    Modify the following parameters in the {Client installation directory}/Spark/spark/conf/spark-defaults.conf file on the Spark client.

    Parameter

    Description

    Example Value

    spark.sql.keep.distinct.expandThreshold

    Threshold (data multiplication factor) for activating this optimization in scenarios that involve data expansion due to operators like cube You need to set this parameter to a value greater than 0 to enable the optimization. For example, if this parameter is set to 1024, this optimization is triggered if the data expands by a factor of 1024 or more.

    If this parameter is set to -1, the optimization is disabled.

    100

    spark.sql.distinct.aggregator.enabled

    Whether to forcibly enable distinct aggregation optimization.

    • false: Forcible enforcement of distinct aggregation optimization is disabled.
    • true: Forcibly enables distinct aggregation optimization, regardless of the data expansion factor. Spark restructures the distinct aggregation logic. Use this setting only when you ensure it is beneficial.

    false