案例:调整GUC参数best_agg_plan
现象描述
t1的表定义为:
1
|
create table t1(a int, b int, c int) store as orc; |
假设agg下层算子所输出结果集的分布列为setA,agg操作的group by列为setB,则在Stream框架下,Agg操作可以分为两个场景。
场景一:setA是setB的一个子集。
对于这种场景,直接对下层结果集进行汇聚的结果就是正确的汇聚结果,上层算子直接使用即可。
场景二:setA不是setB的一个子集。
对于这种场景,Stream执行框架分为如下三种计划形态:
hashagg+gather(redistribute)+hashagg
redistribute+hashagg(+gather)
hashagg+redistribute+hashagg(+gather)
DataArtsFabric SQL提供了guc参数best_agg_plan来干预执行计划,强制其生成上述对应的执行计划,此参数取值范围为0,1,2,3
- 取值为1时,强制生成第一种计划。
- 取值为2时,如果group by列可以重分布,强制生成第二种计划,否则生成第一种计划。
- 取值为3时,如果group by列可以重分布,强制生成第三种计划,否则生成第一种计划。
- 取值为0时,优化器会根据以上三种计划的估算代价选择最优的一种计划生成。
具体影响如下:
postgres=> set best_agg_plan=1; SET postgres=> explain select a, count(1) from t1 group by a; QUERY PLAN ------------------------------------------------------------------------------------------ id | operation | E-rows | E-memory | E-width | E-costs ----+-------------------------------------------+--------+----------+---------+--------- 1 | -> Row Adapter | 200 | | 12 | 85.16 2 | -> Vector Sonic Hash Aggregate | 200 | | 12 | 85.16 3 | -> Vector Streaming (type: GATHER) | 522 | | 12 | 85.16 4 | -> Vector Sonic Hash Aggregate | 522 | 16MB | 12 | 63.41 5 | -> Vector Foreign Scan on t1 | 1000 | 1MB | 4 | 60.00 Predicate Information (identified by plan id) --------------------------------------------- 5 --Vector Foreign Scan on t1 Server Type: lf Total files left: 0 ====== Query Summary ===== ------------------------------ System available mem: 262144KB Query Max mem: 262144KB Query estimated mem: 1872KB (19 rows) postgres=> set best_agg_plan=2; SET postgres=> explain select a, count(1) from t1 group by a; QUERY PLAN -------------------------------------------------------------------------------------------------- id | operation | E-rows | E-memory | E-width | E-costs ----+---------------------------------------------------+--------+----------+---------+--------- 1 | -> Row Adapter | 201 | | 12 | 78.84 2 | -> Vector Streaming (type: GATHER) | 201 | | 12 | 78.84 3 | -> Vector Sonic Hash Aggregate | 201 | 16MB | 12 | 70.47 4 | -> Vector Streaming(type: REDISTRIBUTE) | 1000 | 2MB | 4 | 67.39 5 | -> Vector Foreign Scan on t1 | 1000 | 1MB | 4 | 60.00 Predicate Information (identified by plan id) --------------------------------------------- 5 --Vector Foreign Scan on t1 Server Type: lf Total files left: 0 ====== Query Summary ===== ------------------------------ System available mem: 262144KB Query Max mem: 262144KB Query estimated mem: 3965KB (19 rows) postgres=> set best_agg_plan=3; SET postgres=> explain select a, count(1) from t1 group by a; QUERY PLAN -------------------------------------------------------------------------------------------------- id | operation | E-rows | E-memory | E-width | E-costs ----+---------------------------------------------------+--------+----------+---------+--------- 1 | -> Row Adapter | 200 | | 12 | 82.25 2 | -> Vector Streaming (type: GATHER) | 200 | | 12 | 82.25 3 | -> Vector Sonic Hash Aggregate | 200 | 16MB | 12 | 70.25 4 | -> Vector Streaming(type: REDISTRIBUTE) | 522 | 2MB | 12 | 68.03 5 | -> Vector Sonic Hash Aggregate | 522 | 16MB | 12 | 63.41 6 | -> Vector Foreign Scan on t1 | 1000 | 1MB | 4 | 60.00 Predicate Information (identified by plan id) --------------------------------------------- 6 --Vector Foreign Scan on t1 Server Type: lf Total files left: 0 ====== Query Summary ===== ------------------------------ System available mem: 262144KB Query Max mem: 262144KB Query estimated mem: 3968KB (20 rows)
总结
通常优化器总会选择最优的执行计划,但是众所周知代价估算,尤其是中间结果集的代价估算一般会有比较大的偏差,这种比较大的偏差就可能会导致agg的计算方式出现比较大的偏差,这时候就需要通过best_agg_plan进行agg计算模型的干预。
一般来说,当agg汇聚的收敛度很小时,即结果集的个数在agg之后并没有明显变少时(经验上以5倍为临界点),选择redistribute+hashagg执行方式,否则选择hashagg+redistribute+hashagg执行方式。