Case: Modifying the GUC Parameter rewrite_rule
rewrite_rule contains multiple query rewriting rules: magicset, partialpush, uniquecheck, disablerep, intargetlist, and predpush. The following describes the application scenarios of some important rules:
partialpush: Partial Pushdown
Queries are pushed down to DNs for distributed execution, greatly accelerating queries. If a query statement contains a factor that cannot be pushed down, the entire statement cannot be pushed down. As a result, a stream plan cannot be generated and executed on DNs for the distributed execution, and the performance is poor.
The following is an example:
yshen=# set rewrite_rule='none'; SET yshen=# explain (verbose on, costs off) select two_sum(tt.c1, tt.c2) from (select t1.c1,t2.c2 from t1,t2 where t1.c1=t2.c2) tt(c1,c2); QUERY PLAN ---------------------------------------------------------------------- Hash Join Output: two_sum(t1.c1, t2.c2) Hash Cond: (t1.c1 = t2.c2) -> Data Node Scan on t1 "_REMOTE_TABLE_QUERY_" Output: t1.c1 Node/s: All datanodes Remote query: SELECT c1 FROM ONLY public.t1 WHERE true -> Hash Output: t2.c2 -> Data Node Scan on t2 "_REMOTE_TABLE_QUERY_" Output: t2.c2 Node/s: All datanodes Remote query: SELECT c2 FROM ONLY public.t2 WHERE true (13 rows)
The two_sum() function cannot be pushed down. As a result, the remote query plan is executed:
- Deliver the select c1 from t1 where true statement to DNs to read all data in the t1 table.
- Deliver the select c2 from t2 where true statement to DNs to read all data in the t2 table.
- Perform HASH JOIN on the CN.
- Perform the two_sum calculation and return the final result.
This plan is slow because a large amount of data is transmitted over the network and then HASH JOIN is executed on the CN. As a result, cluster resources cannot be fully used.
partialpush is added to push the preceding 1, 2, and 3 operations down to DNs for distributed execution, greatly improving statement performance.
yshen=# set rewrite_rule='partialpush'; SET yshen=# explain (verbose on, costs off) select two_sum(tt.c1, tt.c2) from (select t1.c1,t2.c2 from t1,t2 where t1.c1=t2.c2) tt(c1,c2); QUERY PLAN --------------------------------------------------------- Subquery Scan on tt Output: two_sum(tt.c1, tt.c2) -> Streaming (type: GATHER) --The Gather plan is executed on DNs in a distributed manner: Output: t1.c1, t2.c2 Node/s: All datanodes -> Nested Loop Output: t1.c1, t2.c2 Join Filter: (t1.c1 = t2.c2) -> Seq Scan on public.t1 Output: t1.c1, t1.c2, t1.c3 Distribute Key: t1.c1 -> Materialize Output: t2.c2 -> Streaming(type: REDISTRIBUTE) Output: t2.c2 Distribute Key: t2.c2 Spawn on: All datanodes Consumer Nodes: All datanodes -> Seq Scan on public.t2 Output: t2.c2 Distribute Key: t2.c1 (21 rows)
intargetlist: Target Column Subquery Performance Improvement
The query performance can be greatly improved by converting the subquery in the target column to JOIN. The following is an example:
yshen=# set rewrite_rule='none'; SET yshen=# explain (verbose on, costs off) select c1,(select avg(c2) from t2 where t2.c2=t1.c2) from t1 where t1.c1<100 order by t1.c2; QUERY PLAN ----------------------------------------------------------------------- Streaming (type: GATHER) Output: t1.c1, ((SubPlan 1)), t1.c2 Merge Sort Key: t1.c2 Node/s: All datanodes -> Sort Output: t1.c1, ((SubPlan 1)), t1.c2 Sort Key: t1.c2 -> Seq Scan on public.t1 Output: t1.c1, (SubPlan 1), t1.c2 Distribute Key: t1.c1 Filter: (t1.c1 < 100) SubPlan 1 -> Aggregate Output: avg(t2.c2) -> Result Output: t2.c2 Filter: (t2.c2 = t1.c2) -> Materialize Output: t2.c2 -> Streaming(type: BROADCAST) Output: t2.c2 Spawn on: All datanodes Consumer Nodes: All datanodes -> Seq Scan on public.t2 Output: t2.c2 Distribute Key: t2.c1 (26 rows)
Because the subquery (select avg(c2) from t2 where t2.c2=t1.c2) in the target column cannot be pulled up, execution of the subquery is triggered each time a row of data of t1 is scanned, and the query efficiency is low. If the intargetlist parameter is enabled, the subquery is converted to JOIN to improve the query performance.
yshen=# set rewrite_rule='intargetlist'; SET yshen=# explain (verbose on, costs off) select c1,(select avg(c2) from t2 where t2.c2=t1.c2) from t1 where t1.c1<100 order by t1.c2; QUERY PLAN --------------------------------------------------------------- Streaming (type: GATHER) Output: t1.c1, (avg(t2.c2)), t1.c2 Merge Sort Key: t1.c2 Node/s: All datanodes -> Sort Output: t1.c1, (avg(t2.c2)), t1.c2 Sort Key: t1.c2 -> Hash Right Join Output: t1.c1, (avg(t2.c2)), t1.c2 Hash Cond: (t2.c2 = t1.c2) -> Streaming(type: BROADCAST) Output: (avg(t2.c2)), t2.c2 Spawn on: All datanodes Consumer Nodes: All datanodes -> HashAggregate Output: avg(t2.c2), t2.c2 Group By Key: t2.c2 -> Streaming(type: REDISTRIBUTE) Output: t2.c2 Distribute Key: t2.c2 Spawn on: All datanodes Consumer Nodes: All datanodes -> Seq Scan on public.t2 Output: t2.c2 Distribute Key: t2.c1 -> Hash Output: t1.c1, t1.c2 -> Seq Scan on public.t1 Output: t1.c1, t1.c2 Distribute Key: t1.c1 Filter: (t1.c1 < 100) (31 rows)
uniquecheck: Performance Improvement of Subqueries Without Aggregate Functions
Ensure that each condition has only one line of output. The subqueries with aggregate functions can be automatically pulled up. For subqueries without aggregate functions, the following is an example:
select t1.c1 from t1 where t1.c1 = (select t2.c1 from t2 where t1.c1=t2.c2) ;
Rewrite as follows:
select t1.c1 from t1 join (select t2.c1 from t2 where t2.c1 is not null group by t2.c1(unique check)) tt(c1) on tt.c1=t1.c1;
To ensure semantic equivalence, the subquery tt must ensure that each group by t2.c1 has only one line of output. Enable the uniquecheck query rewriting parameter to ensure that the query can be pulled up and equivalent. If more than one row of data is output at run time, an error is reported.
yshen=# set rewrite_rule='uniquecheck'; SET yshen=# explain verbose select t1.c1 from t1 where t1.c1 = (select t2.c1 from t2 where t1.c1=t2.c1) ; QUERY PLAN ------------------------------------------------------------------------ Streaming (type: GATHER) Output: t1.c1 Node/s: All datanodes -> Nested Loop Output: t1.c1 Join Filter: (t1.c1 = subquery."?column?") -> Seq Scan on public.t1 Output: t1.c1, t1.c2, t1.c3 Distribute Key: t1.c1 -> Materialize Output: subquery."?column?", subquery.c1 -> Subquery Scan on subquery Output: subquery."?column?", subquery.c1 -> HashAggregate Output: t2.c1, t2.c1 Group By Key: t2.c1 Filter: (t2.c1 IS NOT NULL) Unique Check Required -- If more than one row of data is output during running, an error is reported. -> Index Only Scan using t2idx on public.t2 Output: t2.c1 Distribute Key: t2.c1 (21 rows)
Note: Because group by t2.c1 unique check occurs before the filter condition tt.c1=t1.c1, an error may be reported after the query that does not report an error is rewritten. An example is as follows:
There are tables t1 and t2. The data in the tables is as follows:
yshen=# select * from t1 order by c2; c1 | c2 | c3 ----+----+---- 1 | 1 | 1 2 | 2 | 2 3 | 3 | 3 4 | 4 | 4 5 | 5 | 5 6 | 6 | 6 7 | 7 | 7 8 | 8 | 8 9 | 9 | 9 10 | 10 | 10 (10 rows) yshen=# select * from t2 order by c1; c1 | c2 | c3 ----+----+---- 1 | 1 | 1 2 | 2 | 2 3 | 3 | 3 4 | 4 | 4 5 | 5 | 5 6 | 6 | 6 7 | 7 | 7 8 | 8 | 8 9 | 9 | 9 10 | 10 | 10 11 | 11 | 11 11 | 11 | 11 12 | 12 | 12 12 | 12 | 12 13 | 13 | 13 13 | 13 | 13 14 | 14 | 14 14 | 14 | 14 15 | 15 | 15 15 | 15 | 15 16 | 16 | 16 16 | 16 | 16 17 | 17 | 17 17 | 17 | 17 18 | 18 | 18 18 | 18 | 18 19 | 19 | 19 19 | 19 | 19 20 | 20 | 20 20 | 20 | 20 (30 rows)
Disable and enable the uniquecheck parameter for comparison. After the parameter is enabled, an error is reported.
yshen=# select t1.c1 from t1 where t1.c1 = (select t2.c1 from t2 where t1.c1=t2.c2) ; c1 ---- 6 7 3 1 2 4 5 8 9 10 (10 rows) yshen=# set rewrite_rule='uniquecheck'; SET yshen=# select t1.c1 from t1 where t1.c1 = (select t2.c1 from t2 where t1.c1=t2.c2) ; ERROR: more than one row returned by a subquery used as an expression
predpush, predpushnormal, and predpushforce: Condition Pushdown to Subqueries
Generally, the optimizer performs optimization by query block, and different query blocks are independently optimized. If a predicate condition involving cross-query blocks exists, it is difficult to consider the location of a predicate application from a global perspective. The predpush may push down the predicate to the subquery block, so that performance can be improved in a scenario in which the data volume in the parent query block is relatively small and an index can be used in the subquery. There are three rewriting rules related to predpush:
- predpushnormal: attempts to push predicates down to subqueries. The STREAM operators, such as BROADCAST, are used to implement distributed plans.
- predpushforce: attempts to push down predicates to subqueries and uses the index of the parameterized path for scanning as much as possible.
- predpush: selects an optimal distributed plan from predpushnormal and predpushforce at a cost, but increases optimization time.
The following is an example of a plan for disabling and enabling the query rewriting rule:
openGauss=# show rewrite_rule; rewrite_rule -------------- magicset (1 row) openGauss=# explain (costs off) select * from t1, (select sum(c2), c1 from t2 group by c1) st2 where st2.c1 = t1.c1; QUERY PLAN -------------------------------------- Streaming (type: GATHER) Node/s: All datanodes -> Nested Loop Join Filter: (t1.c1 = t2.c1) -> HashAggregate Group By Key: t2.c1 -> Seq Scan on t2 -> Seq Scan on t1 (8 rows) openGauss=# set rewrite_rule='predpushnormal'; SET openGauss=# explain (costs off) select * from t1, (select sum(c2), c1 from t2 group by c1) st2 where st2.c1 = t1.c1; QUERY PLAN ---------------------------------------------------------- Streaming (type: GATHER) Node/s: All datanodes -> Nested Loop -> Seq Scan on t1 -> GroupAggregate Group By Key: t2.c1 -> Result Filter: (t1.c1 = t2.c1) -> Materialize -> Streaming(type: BROADCAST) Spawn on: All datanodes -> Seq Scan on t2 (12 rows) openGauss=# set rewrite_rule='predpushforce'; SET openGauss=# explain (costs off) select * from t1, (select sum(c2), c1 from t2 group by c1) st2 where st2.c1 = t1.c1; QUERY PLAN ---------------------------------------------------- Streaming (type: GATHER) Node/s: All datanodes -> Nested Loop -> Seq Scan on t1 -> HashAggregate Group By Key: t2.c1 -> Index Scan using t2_c1_idx on t2 Index Cond: (t1.c1 = c1) (8 rows) openGauss=# set rewrite_rule = 'predpush'; SET openGauss=# explain (costs off) select * from t1, (select sum(c2), c1 from t2 group by c1) st2 where st2.c1 = t1.c1; QUERY PLAN ---------------------------------------------------------- Streaming (type: GATHER) Node/s: All datanodes -> Nested Loop -> Seq Scan on t1 -> GroupAggregate Group By Key: t2.c1 -> Result Filter: (t1.c1 = t2.c1) -> Materialize -> Streaming(type: BROADCAST) Spawn on: All datanodes -> Seq Scan on t2 (12 rows)
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot