Help Center/ GaussDB/ Centralized_8.x/ SQL Optimization/ Optimization Cases/ Case: Modifying the GUC Parameter rewrite_rule
Updated on 2024-06-03 GMT+08:00

Case: Modifying the GUC Parameter rewrite_rule

rewrite_rule contains multiple query rewriting rules: magicset, uniquecheck, intargetlist, and predpush. The following describes the application scenarios of some important rules.

Preparing the Case Environment

To demonstrate rule application scenarios, you need to prepare the following table creation statements:

-- Clean the environment.
DROP SCHEMA IF EXISTS rewrite_rule_guc_test CASCADE;
CREATE SCHEMA rewrite_rule_guc_test;
SET current_schema=rewrite_rule_guc_test;
-- Create a test table.
CREATE TABLE t(c1 INT, c2 INT, c3 INT, c4 INT);
CREATE TABLE t1(c1 INT, c2 INT, c3 INT, c4 INT);
CREATE TABLE t2(c1 INT, c2 INT, c3 INT, c4 INT);

intargetlist: Target Column Subquery Performance Improvement

The query performance can be greatly improved by converting the subquery in the target column to JOIN. The following is an example:

gaussdb=#  SET rewrite_rule='none';
SET
gaussdb=#  EXPLAIN (verbose on, costs off) SELECT c1,(SELECT avg(c2) FROM t2 WHERE t2.c2=t1.c2) FROM t1 WHERE t1.c1<100 ORDER BY t1.c2;
                  QUERY PLAN
-----------------------------------------------
 Sort
   Output: t1.c1, ((SubPlan 1)), t1.c2
   Sort Key: t1.c2
   ->  Seq Scan on public.t1
         Output: t1.c1, (SubPlan 1), t1.c2
         Filter: (t1.c1 < 100)
         SubPlan 1
           ->  Aggregate
                 Output: avg(t2.c2)
                 ->  Seq Scan on public.t2
                       Output: t2.c1, t2.c2
                       Filter: (t2.c2 = t1.c2)
(12 rows)

Because the subquery (select avg(c2) from t2 where t2.c2=t1.c2) in the target column cannot be pulled up, execution of the subquery is triggered each time a row of data of t1 is scanned, and the query efficiency is low. If the intargetlist parameter is enabled, the subquery is converted to JOIN to improve the query performance.

gaussdb=#  SET rewrite_rule='intargetlist';
SET
gaussdb=# EXPLAIN (verbose on, costs off) SELECT c1,(SELECT avg(c2) FROM t2 WHERE t2.c2=t1.c2) FROM t1 WHERE t1.c1<100 ORDER BY t1.c2;
                  QUERY PLAN
-----------------------------------------------
 Sort
   Output: t1.c1, (avg(t2.c2)), t1.c2
   Sort Key: t1.c2
   ->  Hash Left Join
         Output: t1.c1, (avg(t2.c2)), t1.c2
         Hash Cond: (t1.c2 = t2.c2)
         ->  Seq Scan on public.t1
               Output: t1.c1, t1.c2
               Filter: (t1.c1 < 100)
         ->  Hash
               Output: (avg(t2.c2)), t2.c2
               ->  HashAggregate
                     Output: avg(t2.c2), t2.c2
                     Group By Key: t2.c2
                     ->  Seq Scan on public.t2
                           Output: t2.c2
(16 rows)

uniquecheck: Performance Improvement of Subqueries Without Aggregate Functions

To ensure subquery pullup, each condition must have only one line of output. The subqueries with aggregate functions can be automatically pulled up. For subqueries without aggregate functions, as shown in the following example:

SELECT t1.c1 FROM t1 WHERE t1.c1 = (SELECT t2.c1 FROM t2 WHERE t1.c1=t2.c2);

Rewrite as follows:

SELECT t1.c1 FROM t1 JOIN (SELECT t2.c1 FROM t2 WHERE t2.c1 IS NOT NULL GROUP BY t2.c1(unique check)) tt(c1) on tt.c1=t1.c1;

Note that unique check in the preceding SQL statement indicates that t2.c1 needs to be checked. If the SQL statement is abnormal, the SQL statement cannot be directly executed. To ensure semantic equivalence, the subquery tt must ensure that each group by t2.c1 has only one line of output. Enable the uniquecheck query rewriting parameter to ensure that the query can be pulled up and equivalent. If more than one row of data is output at run time, an error is reported.

gaussdb=# SET rewrite_rule='uniquecheck';
SET
gaussdb=#  EXPLAIN verbose SELECT t1.c1 FROM t1 WHERE t1.c1 = (SELECT t2.c1 FROM t2 WHERE t1.c1=t2.c1);
                                     QUERY PLAN
-------------------------------------------------------------------------------------
 Hash Join  (cost=43.36..104.40 rows=2149 distinct=[200, 200] width=4)
   Output: t1.c1
   Hash Cond: (t1.c1 = subquery."?column?")
   ->  Seq Scan on public.t1  (cost=0.00..31.49 rows=2149 width=4)
         Output: t1.c1, t1.c2
   ->  Hash  (cost=40.86..40.86 rows=200 width=8)
         Output: subquery."?column?", subquery.c1
         ->  Subquery Scan on subquery  (cost=36.86..40.86 rows=200 width=8)
               Output: subquery."?column?", subquery.c1
               ->  HashAggregate  (cost=36.86..38.86 rows=200 width=4)
                     Output: t2.c1, t2.c1
                     Group By Key: t2.c1
                     Filter: (t2.c1 IS NOT NULL)
                     Unique Check Required
                     ->  Seq Scan on public.t2  (cost=0.00..31.49 rows=2149 width=4)
                           Output: t2.c1
(16 rows)

Note: Because group by t2.c1 unique check occurs before the filter condition tt.c1=t1.c1, an error may be reported after the query that does not report an error is rewritten. An example is as follows:

There are tables t1 and t2. The data in the tables is as follows:

gaussdb=#  SELECT * FROM t1 ORDER BY c2;
 c1 | c2
----+----
  1 |  1
  2 |  2
  3 |  3
(3 rows)
gaussdb=#  SELECT * FROM t2 ORDER BY c2;
 c1 | c2
----+----
  1 |  1
  2 |  2
  3 |  3
  4 |  4
  4 |  4
  5 |  5
(6 rows)

Disable and enable the uniquecheck parameter for comparison. After the parameter is enabled, an error is reported.

gaussdb=#  SELECT t1.c1 FROM t1 WHERE t1.c1 = (SELECT t2.c1 FROM t2 WHERE t1.c1=t2.c2) ;
 c1
----
  1
  2
  3
(3 rows)
gaussdb=#  SET rewrite_rule='uniquecheck';
SET
gaussdb=#  SELECT t1.c1 FROM t1 WHERE t1.c1 = (SELECT t2.c1 FROM t2 WHERE t1.c1=t2.c2) ;
ERROR:  more than one row returned by a subquery used as an expression

Parameter lazyagg: Eliminating the Aggregation Operation in a Subquery

The aggregation operation in a subquery is eliminated to improve the query efficiency. An example is as follows:

gaussdb=# SET rewrite_rule =none;
SET
gaussdb=# EXPLAIN (costs off) SELECT t.c2, sum(cc) FROM (SELECT c2, sum(c3) AS cc FROM t1 GROUP BY c2) s1, t WHERE s1.c2=t.c2 GROUP BY t.c2 ORDER BY 1,2;
                     QUERY PLAN
-----------------------------------------------------
 Sort
   Sort Key: t.c2, (sum(s1.cc))
   ->  HashAggregate
         Group By Key: t.c2
         ->  Hash Join
               Hash Cond: (t.c2 = s1.c2)
               ->  Seq Scan on t
               ->  Hash
                     ->  Subquery Scan on s1
                           ->  HashAggregate
                                 Group By Key: t1.c2
                                 ->  Seq Scan on t1
(12 rows)

A subquery and the outer query have the same GROUP BY condition. The two-layer aggregation operation may lower down the query efficiency. Enable the lazyagg parameter to eliminate the aggregation operation in the subquery and improve the query performance.

gaussdb=# SET rewrite_rule = lazyagg;
SET
gaussdb=# EXPLAIN (costs off) SELECT t.c2, sum(cc) FROM (SELECT c2, sum(c3) AS cc FROM t1 GROUP BY c2) s1, t WHERE s1.c2=t.c2 GROUP BY t.c2 ORDER BY 1,2;
                QUERY PLAN
------------------------------------------
 Sort
   Sort Key: t.c2, (sum((t1.c3)::bigint))
   ->  HashAggregate
         Group By Key: t.c2
         ->  Hash Join
               Hash Cond: (t1.c2 = t.c2)
               ->  Seq Scan on t1
               ->  Hash
                     ->  Seq Scan on t
(9 rows)

Pushing Conditions from the Main Query to a Subquery

Join a subquery with the aggregation operator with the main query in advance. An example is as follows:

gaussdb=# SET rewrite_rule = none;
SET
gaussdb=# EXPLAIN (costs off) SELECT t1 FROM t1 WHERE t1.c2 = 10 AND t1.c3 < (SELECT sum(c3) FROM t2 WHERE t1.c1 = t2.c1);
              QUERY PLAN
---------------------------------------
 Hash Join
   Hash Cond: (t2.c1 = t1.c1)
   Join Filter: (t1.c3 < (sum(t2.c3)))
   ->  HashAggregate
         Group By Key: t2.c1
         ->  Seq Scan on t2
   ->  Hash
         ->  Seq Scan on t1
               Filter: (c2 = 10)
(9 rows)

In this case, the join columns of the subquery are grouped and aggregated, and then the subquery is joined with the main query. This reduces repeated scanning of related sublinks and improves query efficiency. After the rewriting parameters are modified, the plan is changed as follows:

gaussdb=# SET rewrite_rule = magicset;
SET
gaussdb=# explain (costs off) SELECT t1 FROM t1 WHERE t1.c2 = 10 AND t1.c3 < (SELECT sum(c3) FROM t2 WHERE t1.c1 = t2.c1);
                             QUERY PLAN
---------------------------------------------------------------------
 Hash Join
   Hash Cond: (t2.c1 = rewrite_rule_guc_test.t1.c1)
   Join Filter: (rewrite_rule_guc_test.t1.c3 < (sum(t2.c3)))
   ->  HashAggregate
         Group By Key: t2.c1
         ->  Hash Join
               Hash Cond: (t2.c1 = rewrite_rule_guc_test.t1.c1)
               ->  Seq Scan on t2
               ->  Hash
                     ->  HashAggregate
                           Group By Key: rewrite_rule_guc_test.t1.c1
                           ->  Seq Scan on t1
                                 Filter: (c2 = 10)
   ->  Hash
         ->  Seq Scan on t1
               Filter: (c2 = 10)
(16 rows)