Updated on 2022-11-18 GMT+08:00

Merging CBO

Scenario

Spark SQL supports rule-based optimization by default. However, the rule-based optimization cannot ensure that Spark selects the optimal query plan. Cost-Based Optimizer (CBO) is a technology that intelligently selects query plans for SQL statements. After CBO is enabled, the CBO optimizer performs a series of estimations based on the table and column statistics to select the optimal query plan.

Procedure

Perform the following steps to enable CBO:

  1. You need to run corresponding SQL commands to collect required table and column statistics.

    SQL commands are as follows (to be chosen as required):

    • Generate table-level statistics (table scanning):

      ANALYZE TABLE src COMPUTE STATISTICS

      This command generates sizeInBytes and rowCount.

      When you use the ANALYZE statement to collect statistics, sizes of tables not from HDFS cannot be calculated.

    • Generate table-level statistics (no table scanning):

      ANALYZE TABLE src COMPUTE STATISTICS NOSCAN

      This command generates only sizeInBytes. Compared with the originally generated sizeInBytes and rowCount if the sizeInBytes remains unchanged, rowCount (if any) reserves. Otherwise, rowCount is cleared.

    • Generate column-level statistics:

      ANALYZE TABLE src COMPUTE STATISTICS FOR COLUMNS a, b, c

      This command generates column statistics and updates table statistics for consistency. Statistics of complicated data types (such as Seq and Map) and HiveStringType cannot be generated.

    • Display statistics:

      DESC FORMATTED src

      This command displays xxx bytes and xxx rows in Statistics to indicate table-level statistics. You can also run the following command to display column statistics:

      DESC FORMATTED src a

    Limitation: The current statistics collection does not support statistics for partition levels for partitioned tables.

  1. Configure parameters in Table 1 in the spark-defaults.conf file on the Spark client.
    Table 1 Parameter description

    Parameter

    Description

    Default Value

    spark.sql.cbo.enabled

    The switch to enable or disable CBO.

    • true: Enable
    • false: Disable

    To enable this function, ensure that statistics of related tables and columns are generated.

    false

    spark.sql.cbo.joinReorder.enabled

    Specifies whether to automatically adjust the sequence of consecutive inner joins by using CBO.

    • true: Enable
    • false: Disable

    To enable this function, ensure that statistics of related tables and columns are generated and CBO is enabled.

    false

    spark.sql.cbo.joinReorder.dp.threshold

    Specifies the threshold of the number of tables that the sequence of consecutive inner joins is automatically adjusted by CBO.

    If the threshold is exceeded, the sequence of joins is not adjusted.

    12