案例:调整查询重写GUC参数rewrite_rule
rewrite_rule包含了多个查询重写规则:magicset、uniquecheck、intargetlist、predpush等。下面简要说明其中重要的几个规则使用场景。
案例环境准备
为了便于规则的使用场景演示,需准备建表语句如下:
--清理环境 DROP SCHEMA IF EXISTS rewrite_rule_guc_test CASCADE; CREATE SCHEMA rewrite_rule_guc_test; SET current_schema=rewrite_rule_guc_test; --创建测试表 CREATE TABLE t(c1 INT, c2 INT, c3 INT, c4 INT); CREATE TABLE t1(c1 INT, c2 INT, c3 INT, c4 INT); CREATE TABLE t2(c1 INT, c2 INT, c3 INT, c4 INT);
目标列子查询提升参数intargetlist
通过将目标列中子查询提升,转为join,往往可以极大提升查询性能。举例如下查询:
gaussdb=# SET rewrite_rule='none'; SET gaussdb=# EXPLAIN (verbose on, costs off) SELECT c1,(SELECT avg(c2) FROM t2 WHERE t2.c2=t1.c2) FROM t1 WHERE t1.c1<100 ORDER BY t1.c2; QUERY PLAN ----------------------------------------------- Sort Output: t1.c1, ((SubPlan 1)), t1.c2 Sort Key: t1.c2 -> Seq Scan on public.t1 Output: t1.c1, (SubPlan 1), t1.c2 Filter: (t1.c1 < 100) SubPlan 1 -> Aggregate Output: avg(t2.c2) -> Seq Scan on public.t2 Output: t2.c1, t2.c2 Filter: (t2.c2 = t1.c2) (12 rows)
由于目标列中的相关子查询(select avg(c2) from t2 where t2.c2=t1.c2)无法提升的缘故,导致每扫描t1的一行数据,就会触发子查询的一次执行,效率低下。如果打开intargetlist参数会把子查询提升转为join,从而提升查询的性能。
gaussdb=# SET rewrite_rule='intargetlist'; SET gaussdb=# EXPLAIN (verbose on, costs off) SELECT c1,(SELECT avg(c2) FROM t2 WHERE t2.c2=t1.c2) FROM t1 WHERE t1.c1<100 ORDER BY t1.c2; QUERY PLAN ----------------------------------------------- Sort Output: t1.c1, (avg(t2.c2)), t1.c2 Sort Key: t1.c2 -> Hash Left Join Output: t1.c1, (avg(t2.c2)), t1.c2 Hash Cond: (t1.c2 = t2.c2) -> Seq Scan on public.t1 Output: t1.c1, t1.c2 Filter: (t1.c1 < 100) -> Hash Output: (avg(t2.c2)), t2.c2 -> HashAggregate Output: avg(t2.c2), t2.c2 Group By Key: t2.c2 -> Seq Scan on public.t2 Output: t2.c2 (16 rows)
提升无agg的子查询uniquecheck
子链接提升需要保证对于每个条件只有一行输出,对于有agg的子查询可以自动提升,对于无agg的子查询如:
SELECT t1.c1 FROM t1 WHERE t1.c1 = (SELECT t2.c1 FROM t2 WHERE t1.c1=t2.c2);
重写为:
SELECT t1.c1 FROM t1 JOIN (SELECT t2.c1 FROM t2 WHERE t2.c1 IS NOT NULL GROUP BY t2.c1(unique check)) tt(c1) on tt.c1=t1.c1;
需注意,上述SQL中的unique check表示t2.c1需要进行检查,非正常SQL表达,该SQL无法直接执行。为了保证语义等价,子查询tt必须保证对于每个group by t2.c1只能有一行输出。打开uniquecheck查询重写参数保证可以提升并且等价,如果在运行时输出了多于一行的数据,就会报错。
gaussdb=# SET rewrite_rule='uniquecheck'; SET gaussdb=# EXPLAIN verbose SELECT t1.c1 FROM t1 WHERE t1.c1 = (SELECT t2.c1 FROM t2 WHERE t1.c1=t2.c1); QUERY PLAN ------------------------------------------------------------------------------------- Hash Join (cost=43.36..104.40 rows=2149 distinct=[200, 200] width=4) Output: t1.c1 Hash Cond: (t1.c1 = subquery."?column?") -> Seq Scan on public.t1 (cost=0.00..31.49 rows=2149 width=4) Output: t1.c1, t1.c2 -> Hash (cost=40.86..40.86 rows=200 width=8) Output: subquery."?column?", subquery.c1 -> Subquery Scan on subquery (cost=36.86..40.86 rows=200 width=8) Output: subquery."?column?", subquery.c1 -> HashAggregate (cost=36.86..38.86 rows=200 width=4) Output: t2.c1, t2.c1 Group By Key: t2.c1 Filter: (t2.c1 IS NOT NULL) Unique Check Required -> Seq Scan on public.t2 (cost=0.00..31.49 rows=2149 width=4) Output: t2.c1 (16 rows)
注意:因为分组group by t2.c1 unique check发生在过滤条件tt.c1=t1.c1之前,可能导致原来不报错的查询重写之后报错。举例:
有t1,t2表,其中的数据为:
gaussdb=# SELECT * FROM t1 ORDER BY c2; c1 | c2 ----+---- 1 | 1 2 | 2 3 | 3 (3 rows) gaussdb=# SELECT * FROM t2 ORDER BY c2; c1 | c2 ----+---- 1 | 1 2 | 2 3 | 3 4 | 4 4 | 4 5 | 5 (6 rows)
分别关闭和打开uniquecheck参数对比,打开之后报错。
gaussdb=# SELECT t1.c1 FROM t1 WHERE t1.c1 = (SELECT t2.c1 FROM t2 WHERE t1.c1=t2.c2) ; c1 ---- 1 2 3 (3 rows) gaussdb=# SET rewrite_rule='uniquecheck'; SET gaussdb=# SELECT t1.c1 FROM t1 WHERE t1.c1 = (SELECT t2.c1 FROM t2 WHERE t1.c1=t2.c2) ; ERROR: more than one row returned by a subquery used as an expression
消除子查询中的聚集运算参数lazyagg
消除子查询中的聚集运算,以此提高查询效率。举例:
gaussdb=# SET rewrite_rule =none; SET gaussdb=# EXPLAIN (costs off) SELECT t.c2, sum(cc) FROM (SELECT c2, sum(c3) AS cc FROM t1 GROUP BY c2) s1, t WHERE s1.c2=t.c2 GROUP BY t.c2 ORDER BY 1,2; QUERY PLAN ----------------------------------------------------- Sort Sort Key: t.c2, (sum(s1.cc)) -> HashAggregate Group By Key: t.c2 -> Hash Join Hash Cond: (t.c2 = s1.c2) -> Seq Scan on t -> Hash -> Subquery Scan on s1 -> HashAggregate Group By Key: t1.c2 -> Seq Scan on t1 (12 rows)
子查询与外层查询存在同样的group by条件,两层聚集运算可能导致查询效率低下,打开lazyagg参数,消除子查询中的聚集运算,提升查询性能:
gaussdb=# SET rewrite_rule = lazyagg; SET gaussdb=# EXPLAIN (costs off) SELECT t.c2, sum(cc) FROM (SELECT c2, sum(c3) AS cc FROM t1 GROUP BY c2) s1, t WHERE s1.c2=t.c2 GROUP BY t.c2 ORDER BY 1,2; QUERY PLAN ------------------------------------------ Sort Sort Key: t.c2, (sum((t1.c3)::bigint)) -> HashAggregate Group By Key: t.c2 -> Hash Join Hash Cond: (t1.c2 = t.c2) -> Seq Scan on t1 -> Hash -> Seq Scan on t (9 rows)
从主查询下推条件到子查询参数magicset
将带有聚集算子的子查询提前和主查询进行关联。举例:
gaussdb=# SET rewrite_rule = none; SET gaussdb=# EXPLAIN (costs off) SELECT t1 FROM t1 WHERE t1.c2 = 10 AND t1.c3 < (SELECT sum(c3) FROM t2 WHERE t1.c1 = t2.c1); QUERY PLAN --------------------------------------- Hash Join Hash Cond: (t2.c1 = t1.c1) Join Filter: (t1.c3 < (sum(t2.c3))) -> HashAggregate Group By Key: t2.c1 -> Seq Scan on t2 -> Hash -> Seq Scan on t1 Filter: (c2 = 10) (9 rows)
先针对子查询的关联字段进行分组聚集,再和主查询进行关联,减少相关子链接的重复扫描,提升查询效率,修改重写参数后,计划改变:
gaussdb=# SET rewrite_rule = magicset; SET gaussdb=# explain (costs off) SELECT t1 FROM t1 WHERE t1.c2 = 10 AND t1.c3 < (SELECT sum(c3) FROM t2 WHERE t1.c1 = t2.c1); QUERY PLAN --------------------------------------------------------------------- Hash Join Hash Cond: (t2.c1 = rewrite_rule_guc_test.t1.c1) Join Filter: (rewrite_rule_guc_test.t1.c3 < (sum(t2.c3))) -> HashAggregate Group By Key: t2.c1 -> Hash Join Hash Cond: (t2.c1 = rewrite_rule_guc_test.t1.c1) -> Seq Scan on t2 -> Hash -> HashAggregate Group By Key: rewrite_rule_guc_test.t1.c1 -> Seq Scan on t1 Filter: (c2 = 10) -> Hash -> Seq Scan on t1 Filter: (c2 = 10) (16 rows)