Merging CBO
Scenario
Spark SQL supports rule-based optimization by default. However, the rule-based optimization cannot ensure that Spark selects the optimal query plan. Cost-Based Optimizer (CBO) is a technology that intelligently selects query plans for SQL statements. After CBO is enabled, the CBO optimizer performs a series of estimations based on the table and column statistics to select the optimal query plan.
Procedure
Perform the following steps to enable CBO:
- You need to run corresponding SQL commands to collect required table and column statistics.
SQL commands are as follows (to be chosen as required):
- Generate table-level statistics (table scanning):
ANALYZE TABLE src COMPUTE STATISTICS
This command generates sizeInBytes and rowCount.
When you use the ANALYZE statement to collect statistics, sizes of tables not from HDFS cannot be calculated.
- Generate table-level statistics (no table scanning):
ANALYZE TABLE src COMPUTE STATISTICS NOSCAN
This command generates only sizeInBytes. Compared with the originally generated sizeInBytes and rowCount if the sizeInBytes remains unchanged, rowCount (if any) reserves. Otherwise, rowCount is cleared.
- Generate column-level statistics:
ANALYZE TABLE src COMPUTE STATISTICS FOR COLUMNS a, b, c
This command generates column statistics and updates table statistics for consistency. Statistics of complicated data types (such as Seq and Map) and HiveStringType cannot be generated.
- Display statistics:
This command displays xxx bytes and xxx rows in Statistics to indicate table-level statistics. You can also run the following command to display column statistics:
DESC FORMATTED src a
Limitation: The current statistics collection does not support statistics for partition levels for partitioned tables.
- Generate table-level statistics (table scanning):
- Configure parameters in Table 1 in the spark-defaults.conf file on the Spark client.
Table 1 Parameter description Parameter
Description
Default Value
spark.sql.cbo.enabled
The switch to enable or disable CBO.
- true: Enable
- false: Disable
To enable this function, ensure that statistics of related tables and columns are generated.
false
spark.sql.cbo.joinReorder.enabled
Specifies whether to automatically adjust the sequence of consecutive inner joins by using CBO.
- true: Enable
- false: Disable
To enable this function, ensure that statistics of related tables and columns are generated and CBO is enabled.
false
spark.sql.cbo.joinReorder.dp.threshold
Specifies the threshold of the number of tables that the sequence of consecutive inner joins is automatically adjusted by CBO.
If the threshold is exceeded, the sequence of joins is not adjusted.
12
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot