Configuring the Column Statistics Histogram for Higher CBO Accuracy

Scenario

Typically, Spark SQL statements are optimized using heuristic optimization rules. Such rules are provided only based on the characteristics of the logical plan and the characteristics of the data (the execution cost of the operator) are not considered. Spark 2.2 introduces cost-based optimizer (CBO). CBO collects table and column statistics and estimates the number of output records and byte size of each operator based on the input data set of the operator. These are the cost of executing an operator.

CBO adjusts the execution plan to minimize the end-to-end query time. The idea is as follows:

Filter out irrelevant data as early as possible.
Minimize the cost of each operator.

The CBO optimization process is divided into two steps:

Collect statistics.
Estimate the output data set of a specific operator based on the input data set.

Table-level statistics include the number of records and the total size of table data files.

Column-level statistics include the number of unique values, maximum value, minimum value, number of null values, average length, maximum length, and histogram.

After the statistics are obtained, the execution cost of the operator can be estimated. Common operators include the Filter and Join operators.

Histogram is a type of column statistics. It can clearly describe the distribution of column data. The column data is distributed to a specified number of bins that are displayed in ascending order by size. The upper and lower limits of each bin are calculated. The amount of data in all bins is the same (a contour histogram). With the detailed distribution of data, the cost estimation of each operator is more accurate and the optimization effect is better.

This feature can be enabled using the following parameter:

spark.sql.statistics.histogram.enabled: specifies whether to enable the histogram function. The default value is false.

Parameters

Log in to FusionInsight Manager and choose Cluster > Services > Spark. Click Configurations then All Configurations, and search for the following parameters:

Parameter	Description	Default Value	Value Range
spark.sql.cbo.enabled	Whether to enable CBO to estimate the statistics of the execution plan	false	[true,false]
spark.sql.cbo.joinReorder.enabled	Whether to enable CBO connection reordering	false	[true,false]
spark.sql.cbo.joinReorder.dp.threshold	Maximum number of join nodes allowed in the dynamic planning algorithm	12	>=1
spark.sql.cbo.joinReorder.card.weight	Proportion of the dimension (number of rows) in the cost comparison of the reconnection execution plan: Number of rows x Proportion + File size x (1 – Proportion)	0.7	0-1
spark.sql.statistics.size.autoUpdate.enabled	Whether to enable the function of automatically updating the table size when the table data changes. If there are a large number of data files in a table, this operation consumes a lot of resources and slows down data operations.	false	[true,false]
spark.sql.statistics.histogram.enabled	After this function is enabled, a histogram is generated when column information is collected. Histograms can improve estimation accuracy, but collecting histogram information requires additional workload.	false	[true,false]
spark.sql.statistics.histogram.numBins	Number of slots in the generated histogram	254	>=2
spark.sql.statistics.ndv.maxError	Maximum estimation error allowed by the HyperLogLog++ algorithm when column-level statistics are generated	0.05	0-1
spark.sql.statistics.percentile.accuracy	Accuracy of percentile estimation when generating equal height histograms. A larger value indicates more accuracy. The estimated error value can be obtained using 1.0/Percentile estimation accuracy.	10000	>=1

A histogram takes effect in CBO only when the following conditions are met:
- spark.sql.statistics.histogram.enabled: The default value is false. Change the value to true to enable the histogram function.
- spark.sql.cbo.enabled: The default value is false. Change the value to true to enable CBO.
- spark.sql.cbo.joinReorder.enabled: The default value is false. Change the value to true to enable connection reordering.
If a client is used to submit tasks, the modification of spark.sql.cbo.enabled, spark.sql.cbo.joinReorder.enabled, spark.sql.cbo.joinReorder.dp.threshold, spark.sql.cbo.joinReorder.card.weight, spark.sql.statistics.size.autoUpdate.enabled, spark.sql.statistics.histogram.enabled, spark.sql.statistics.histogram.numBins, spark.sql.statistics.ndv.maxError, and spark.sql.statistics.percentile.accuracy takes effect only after the client is downloaded again.