Hive CBO Principles
Hive CBO Principles
CBO is short for Cost-Based Optimization.
It will optimize the following:
During compilation, the CBO calculates the most efficient join sequence based on tables and query conditions involved in query statements to reduce time and resources required for query.
In Hive, the CBO is implemented as follows:
Hive uses open-source component Apache Calcite to implement the CBO. SQL statements are first converted into Hive Abstract Syntax Trees (ASTs) and then into RelNodes that can be identified by Calcite. After Calcite adjusts the join sequence in RelNodes, RelNodes are converted into ASTs by Hive to continue the logical and physical optimization. Figure 1 shows the working flow.
Calcite adjusts the join sequence as follows:
- A table is selected as the first table from the tables to be joined.
- The second and third tables are selected based on the cost. In this way, multiple different execution plans are obtained.
- A plan with the minimum costs is calculated and serves as the final sequence.
The cost calculation method is as follows:
In the current version, costs are measured based on the number of data entries after joining. Fewer data entries mean less cost. The number of joined data entries depends on the selection rate of joined tables. The number of data entries in a table is obtained based on the table-level statistics.
The number of data entries in a table after filtering is estimated based on the column-level statistics, including the maximum values (max), minimum values (min), and Number of Distinct Values (NDV).
For example, there is a table table_a whose total number of data records is 1,000,000 and NDV is 50. The query conditions are as follows:
Select * from table_a where colum_a='value1';
The estimated number of queried data entries is: 1,000,000 x 1/50 = 20,000. The selection rate is 2%.
The following takes the TPC-DS Q3 as an example to describe how the CBO adjusts the join sequence:
select dt.d_year, item.i_brand_id brand_id, item.i_brand brand, sum(ss_ext_sales_price) sum_agg from date_dim dt, store_sales, item where dt.d_date_sk = store_sales.ss_sold_date_sk and store_sales.ss_item_sk = item.i_item_sk and item.i_manufact_id = 436 and dt.d_moy = 12 group by dt.d_year , item.i_brand , item.i_brand_id order by dt.d_year , sum_agg desc , brand_id limit 10;
Statement explanation: This statement indicates that inner join is performed for three tables: table store_sales is a fact table with about 2,900,000,000 data entries, table date_dim is a dimension table with about 73,000 data entries, and table item is a dimension table with about 18,000 data entries. Each table has filtering conditions. Figure 2 shows the join relationship.
The CBO must first select the tables that bring better filtering effect for joining.
By analyzing min, max, NDV, and the number of data entries, the CBO estimates the selection rates of different dimension tables, as shown in Table 1.
Table |
Number of Original Data Entries |
Number of Data Entries After Filtering |
Selection Rate |
---|---|---|---|
date_dim |
73,000 |
6,200 |
8.5% |
item |
18,000 |
19 |
0.1% |
The selection rate can be estimated as follows: Selection rate = Number of data entries after filtering/Number of original data entries
As shown in the preceding table, the item table has a better filtering effect. Therefore, the CBO joins the item table first before joining the date_dim table.
Figure 3 shows the join process when the CBO is disabled.
Figure 4 shows the join process when the CBO is enabled.
After the CBO is enabled, the number of intermediate data entries is reduced from 495,000,000 to 2,900,000 and thus the execution time can be remarkably reduced.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.