Compute
Elastic Cloud Server
Huawei Cloud Flexus
Bare Metal Server
Auto Scaling
Image Management Service
Dedicated Host
FunctionGraph
Cloud Phone Host
Huawei Cloud EulerOS
Networking
Virtual Private Cloud
Elastic IP
Elastic Load Balance
NAT Gateway
Direct Connect
Virtual Private Network
VPC Endpoint
Cloud Connect
Enterprise Router
Enterprise Switch
Global Accelerator
Management & Governance
Cloud Eye
Identity and Access Management
Cloud Trace Service
Resource Formation Service
Tag Management Service
Log Tank Service
Config
OneAccess
Resource Access Manager
Simple Message Notification
Application Performance Management
Application Operations Management
Organizations
Optimization Advisor
IAM Identity Center
Cloud Operations Center
Resource Governance Center
Migration
Server Migration Service
Object Storage Migration Service
Cloud Data Migration
Migration Center
Cloud Ecosystem
KooGallery
Partner Center
User Support
My Account
Billing Center
Cost Center
Resource Center
Enterprise Management
Service Tickets
HUAWEI CLOUD (International) FAQs
ICP Filing
Support Plans
My Credentials
Customer Operation Capabilities
Partner Support Plans
Professional Services
Analytics
MapReduce Service
Data Lake Insight
CloudTable Service
Cloud Search Service
Data Lake Visualization
Data Ingestion Service
GaussDB(DWS)
DataArts Studio
Data Lake Factory
DataArts Lake Formation
IoT
IoT Device Access
Others
Product Pricing Details
System Permissions
Console Quick Start
Common FAQs
Instructions for Associating with a HUAWEI CLOUD Partner
Message Center
Security & Compliance
Security Technologies and Applications
Web Application Firewall
Host Security Service
Cloud Firewall
SecMaster
Anti-DDoS Service
Data Encryption Workshop
Database Security Service
Cloud Bastion Host
Data Security Center
Cloud Certificate Manager
Edge Security
Managed Threat Detection
Blockchain
Blockchain Service
Web3 Node Engine Service
Media Services
Media Processing Center
Video On Demand
Live
SparkRTC
MetaStudio
Storage
Object Storage Service
Elastic Volume Service
Cloud Backup and Recovery
Storage Disaster Recovery Service
Scalable File Service Turbo
Scalable File Service
Volume Backup Service
Cloud Server Backup Service
Data Express Service
Dedicated Distributed Storage Service
Containers
Cloud Container Engine
SoftWare Repository for Container
Application Service Mesh
Ubiquitous Cloud Native Service
Cloud Container Instance
Databases
Relational Database Service
Document Database Service
Data Admin Service
Data Replication Service
GeminiDB
GaussDB
Distributed Database Middleware
Database and Application Migration UGO
TaurusDB
Middleware
Distributed Cache Service
API Gateway
Distributed Message Service for Kafka
Distributed Message Service for RabbitMQ
Distributed Message Service for RocketMQ
Cloud Service Engine
Multi-Site High Availability Service
EventGrid
Dedicated Cloud
Dedicated Computing Cluster
Business Applications
Workspace
ROMA Connect
Message & SMS
Domain Name Service
Edge Data Center Management
Meeting
AI
Face Recognition Service
Graph Engine Service
Content Moderation
Image Recognition
Optical Character Recognition
ModelArts
ImageSearch
Conversational Bot Service
Speech Interaction Service
Huawei HiLens
Video Intelligent Analysis Service
Developer Tools
SDK Developer Guide
API Request Signing Guide
Terraform
Koo Command Line Interface
Content Delivery & Edge Computing
Content Delivery Network
Intelligent EdgeFabric
CloudPond
Intelligent EdgeCloud
Solutions
SAP Cloud
High Performance Computing
Developer Services
ServiceStage
CodeArts
CodeArts PerfTest
CodeArts Req
CodeArts Pipeline
CodeArts Build
CodeArts Deploy
CodeArts Artifact
CodeArts TestPlan
CodeArts Check
CodeArts Repo
Cloud Application Engine
MacroVerse aPaaS
KooMessage
KooPhone
KooDrive
Help Center/ GaussDB/ Developer Guide(Distributed_3.x)/ SQL Optimization/ Optimization Cases/ Case: Modifying the GUC Parameter rewrite_rule

Case: Modifying the GUC Parameter rewrite_rule

Updated on 2024-05-07 GMT+08:00

rewrite_rule contains multiple query rewriting rules: magicset, partialpush, uniquecheck, disablerep, intargetlist, and predpush. The following describes the application scenarios of some important rules.

Preparing the Case Environment

To demonstrate rule application scenarios, you need to prepare the following table creation statements:

-- Clean the environment.
DROP SCHEMA IF EXISTS rewrite_rule_guc_test CASCADE; 
CREATE SCHEMA rewrite_rule_guc_test;
SET current_schema=rewrite_rule_guc_test;
-- Create a test table.
CREATE TABLE t(c1 INT, c2 INT, c3 INT, c4 INT);
CREATE TABLE t1(c1 INT, c2 INT, c3 INT, c4 INT);
CREATE TABLE t2(c1 INT, c2 INT, c3 INT, c4 INT);

partialpush: Partial Pushdown

Queries are pushed down to DNs for distributed execution, greatly accelerating queries. If a query statement contains a factor that cannot be pushed down, the entire statement cannot be pushed down. As a result, a stream plan cannot be generated and executed on DNs for the distributed execution, and the performance is poor.

The following is an example:

gaussdb=# set rewrite_rule='none'; 
SET
gaussdb=# explain (verbose on, costs off)  select group_concat(tt.c1, tt.c2) from (select t1.c1,t2.c2 from t1,t2 where t1.c1=t2.c2) tt(c1,c2);
                                 QUERY PLAN
----------------------------------------------------------------------------
 Aggregate
   Output: group_concat(t1.c1, t2.c2 SEPARATOR ',')
   ->  Hash Join
         Output: t1.c1, t2.c2
         Hash Cond: (t1.c1 = t2.c2)
         ->  Data Node Scan on t1 "_REMOTE_TABLE_QUERY_"
               Output: t1.c1
               Node/s: All datanodes
               Remote query: SELECT c1 FROM ONLY public.t1 WHERE true
         ->  Hash
               Output: t2.c2
               ->  Data Node Scan on t2 "_REMOTE_TABLE_QUERY_"
                     Output: t2.c2
                     Node/s: All datanodes
                     Remote query: SELECT c2 FROM ONLY public.t2 WHERE true

The group_concat() function cannot be pushed down. As a result, the remote query plan is executed:

  1. Deliver the select c1 from t1 where true statement to DNs to read all data in the t1 table.
  2. Deliver the select c2 from t2 where true statement to DNs to read all data in the t2 table.
  3. Perform HASH JOIN on the CN.
  4. Perform the group_concat calculation and return the final result.

This plan is slow because a large amount of data is transmitted over the network and then HASH JOIN is executed on the CN. As a result, cluster resources cannot be fully used.

partialpush is added to push the preceding 1, 2, and 3 operations down to DNs for distributed execution, greatly improving statement performance.

gaussdb=#  set rewrite_rule='partialpush'; 
SET
gaussdb=#  explain (verbose on, costs off) select two_sum(tt.c1, tt.c2) from (select t1.c1,t2.c2 from t1,t2 where t1.c1=t2.c2) tt(c1,c2);
                       QUERY PLAN
---------------------------------------------------------
 Subquery Scan on tt
   Output: two_sum(tt.c1, tt.c2)
   ->  Streaming (type: GATHER)  -- The Gather plan is executed on DNs in a distributed manner:
         Output: t1.c1, t2.c2
         Node/s: All datanodes
         ->  Nested Loop
               Output: t1.c1, t2.c2
               Join Filter: (t1.c1 = t2.c2)
               ->  Seq Scan on public.t1
                     Output: t1.c1, t1.c2, t1.c3
                     Distribute Key: t1.c1
               ->  Materialize
                     Output: t2.c2
                     ->  Streaming(type: REDISTRIBUTE)
                           Output: t2.c2
                           Distribute Key: t2.c2
                           Spawn on: All datanodes
                           Consumer Nodes: All datanodes
                           ->  Seq Scan on public.t2
                                 Output: t2.c2
                                 Distribute Key: t2.c1
(21 rows)

intargetlist: Target Column Subquery Performance Improvement

The query performance can be greatly improved by converting the subquery in the target column to JOIN. The following is an example:

gaussdb=#  set rewrite_rule='none'; 
SET
gaussdb=#  explain (verbose on, costs off) select c1,(select avg(c2) from t2 where t2.c2=t1.c2) from t1 where t1.c1<100 order by t1.c2;
                              QUERY PLAN
-----------------------------------------------------------------------
 Streaming (type: GATHER)
   Output: t1.c1, ((SubPlan 1)), t1.c2
   Merge Sort Key: t1.c2
   Node/s: All datanodes
   ->  Sort
         Output: t1.c1, ((SubPlan 1)), t1.c2
         Sort Key: t1.c2
         ->  Seq Scan on public.t1
               Output: t1.c1, (SubPlan 1), t1.c2
               Distribute Key: t1.c1
               Filter: (t1.c1 < 100)
               SubPlan 1
                 ->  Aggregate
                       Output: avg(t2.c2)
                       ->  Result
                             Output: t2.c2
                             Filter: (t2.c2 = t1.c2)
                             ->  Materialize
                                   Output: t2.c2
                                   ->  Streaming(type: BROADCAST)
                                         Output: t2.c2
                                         Spawn on: All datanodes
                                         Consumer Nodes: All datanodes
                                         ->  Seq Scan on public.t2
                                               Output: t2.c2
                                               Distribute Key: t2.c1
(26 rows)

Because the subquery (select avg(c2) from t2 where t2.c2=t1.c2) in the target column cannot be pulled up, execution of the subquery is triggered each time a row of data of t1 is scanned, and the query efficiency is low. If the intargetlist parameter is enabled, the subquery is converted to JOIN to improve the query performance.

gaussdb=#  set rewrite_rule='intargetlist';
SET
gaussdb=#  explain (verbose on, costs off) select c1,(select avg(c2) from t2 where t2.c2=t1.c2) from t1 where t1.c1<100 order by t1.c2;
                          QUERY PLAN
---------------------------------------------------------------
 Streaming (type: GATHER)
   Output: t1.c1, (avg(t2.c2)), t1.c2
   Merge Sort Key: t1.c2
   Node/s: All datanodes
   ->  Sort
         Output: t1.c1, (avg(t2.c2)), t1.c2
         Sort Key: t1.c2
         ->  Hash Right Join
               Output: t1.c1, (avg(t2.c2)), t1.c2
               Hash Cond: (t2.c2 = t1.c2)
               ->  Streaming(type: BROADCAST)
                     Output: (avg(t2.c2)), t2.c2
                     Spawn on: All datanodes
                     Consumer Nodes: All datanodes
                     ->  HashAggregate
                           Output: avg(t2.c2), t2.c2
                           Group By Key: t2.c2
                           ->  Streaming(type: REDISTRIBUTE)
                                 Output: t2.c2
                                 Distribute Key: t2.c2
                                 Spawn on: All datanodes
                                 Consumer Nodes: All datanodes
                                 ->  Seq Scan on public.t2
                                       Output: t2.c2
                                       Distribute Key: t2.c1
               ->  Hash
                     Output: t1.c1, t1.c2
                     ->  Seq Scan on public.t1
                           Output: t1.c1, t1.c2
                           Distribute Key: t1.c1
                           Filter: (t1.c1 < 100)
(31 rows)

uniquecheck: Performance Improvement of Subqueries Without Aggregate Functions

Ensure that each condition has only one line of output. The subqueries with aggregate functions can be automatically pulled up. For subqueries without aggregate functions, the following is an example:

select t1.c1 from t1 where t1.c1 = (select t2.c1 from t2 where t1.c1=t2.c2) ;

Rewrite as follows:

select t1.c1 from t1 join (select t2.c1 from t2 where t2.c1 is not null group by t2.c1(unique check)) tt(c1) on tt.c1=t1.c1;

Note that unique check in the preceding SQL statement indicates that t2.c1 needs to be checked. If the SQL statement is abnormal, the SQL statement cannot be directly executed. To ensure semantic equivalence, the subquery tt must ensure that each group by t2.c1 has only one line of output. Enable the uniquecheck query rewriting parameter to ensure that the query can be pulled up and equivalent. If more than one row of data is output at run time, an error is reported.

gaussdb=#  set rewrite_rule='uniquecheck';
SET
gaussdb=#  explain verbose select t1.c1 from t1 where t1.c1 = (select t2.c1 from t2 where t1.c1=t2.c1) ;
                               QUERY PLAN
------------------------------------------------------------------------
 Streaming (type: GATHER)
   Output: t1.c1
   Node/s: All datanodes
   ->  Nested Loop
         Output: t1.c1
         Join Filter: (t1.c1 = subquery."?column?")
         ->  Seq Scan on public.t1
               Output: t1.c1, t1.c2, t1.c3
               Distribute Key: t1.c1
         ->  Materialize
               Output: subquery."?column?", subquery.c1
               ->  Subquery Scan on subquery
                     Output: subquery."?column?", subquery.c1
                     ->  HashAggregate
                           Output: t2.c1, t2.c1
                           Group By Key: t2.c1
                           Filter: (t2.c1 IS NOT NULL)
                           Unique Check Required -- If more than one row of data is output during running, an error is reported.
                           ->  Index Only Scan using t2idx on public.t2
                                 Output: t2.c1
                                 Distribute Key: t2.c1
(21 rows)

Note: Because group by t2.c1 unique check occurs before the filter condition tt.c1=t1.c1, an error may be reported after the query that does not report an error is rewritten. An example is as follows:

There are tables t1 and t2. The data in the tables is as follows:

gaussdb=#  select * from t1 order by c2;
 c1 | c2 | c3
----+----+----
  1 |  1 |  1
  2 |  2 |  2
  3 |  3 |  3
  4 |  4 |  4
  5 |  5 |  5
  6 |  6 |  6
  7 |  7 |  7
  8 |  8 |  8
  9 |  9 |  9
 10 | 10 | 10
(10 rows)

gaussdb=#  select * from t2 order by c1;
 c1 | c2 | c3
----+----+----
  1 |  1 |  1
  2 |  2 |  2
  3 |  3 |  3
  4 |  4 |  4
  5 |  5 |  5
  6 |  6 |  6
  7 |  7 |  7
  8 |  8 |  8
  9 |  9 |  9
 10 | 10 | 10
 11 | 11 | 11
 11 | 11 | 11
 12 | 12 | 12
 12 | 12 | 12
 13 | 13 | 13
 13 | 13 | 13
 14 | 14 | 14
 14 | 14 | 14
 15 | 15 | 15
 15 | 15 | 15
 16 | 16 | 16
 16 | 16 | 16
 17 | 17 | 17
 17 | 17 | 17
 18 | 18 | 18
 18 | 18 | 18
 19 | 19 | 19
 19 | 19 | 19
 20 | 20 | 20
 20 | 20 | 20
(30 rows)

Disable and enable the uniquecheck parameter for comparison. After the parameter is enabled, an error is reported.

gaussdb=#   select t1.c1 from t1 where t1.c1 = (select t2.c1 from t2 where t1.c1=t2.c2) ;
 c1
----
  6
  7
  3
  1
  2
  4
  5
  8
  9
 10
(10 rows)

gaussdb=#  set rewrite_rule='uniquecheck';
SET
gaussdb=#   select t1.c1 from t1 where t1.c1 = (select t2.c1 from t2 where t1.c1=t2.c2) ;
ERROR:  more than one row returned by a subquery used as an expression

predpush, predpushnormal, and predpushforce: Condition Pushdown to Subqueries

Generally, the optimizer performs optimization by query block, and different query blocks are independently optimized. If a predicate condition involving cross-query blocks exists, it is difficult to consider the location of a predicate application from a global perspective. The predpush may push down the predicate to the subquery block, so that performance can be improved in a scenario in which the data volume in the parent query block is relatively small and an index can be used in the subquery. There are three rewriting rules related to predpush:

  • predpushnormal: attempts to push predicates down to subqueries. The STREAM operators, such as BROADCAST, are used to implement distributed plans.
  • predpushforce: attempts to push down predicates to subqueries and uses the index of the parameterized path for scanning as much as possible.
  • predpush: selects an optimal distributed plan from predpushnormal and predpushforce at a cost, but increases optimization time.

The following is an example of a plan for disabling and enabling the query rewriting rule:

gaussdb=# set enable_fast_query_shipping = off; -- Disable FQS optimization.
SET
gaussdb=# show rewrite_rule;
 rewrite_rule
--------------
 magicset
(1 row)

gaussdb=# explain (costs off) select * from t1, (select sum(c2), c1 from t2 group by c1) st2 where st2.c1 = t1.c1;
              QUERY PLAN
--------------------------------------
 Streaming (type: GATHER)
   Node/s: All datanodes
   ->  Nested Loop
         Join Filter: (t1.c1 = t2.c1)
         ->  HashAggregate
               Group By Key: t2.c1
               ->  Seq Scan on t2
         ->  Seq Scan on t1
(8 rows)


gaussdb=# set rewrite_rule='predpushnormal';
SET
gaussdb=# explain (costs off) select * from t1, (select sum(c2), c1 from t2 group by c1) st2 where st2.c1 = t1.c1;
                 QUERY PLAN
---------------------------------------------
 Streaming (type: GATHER)
   Node/s: All datanodes
   ->  Nested Loop
         ->  Seq Scan on t1
         ->  HashAggregate
               Group By Key: t2.c1
               ->  Result
                     Filter: (t1.c1 = t2.c1)
                     ->  Seq Scan on t2
(9 rows)

-- You can see that the filter criteria are pushed to the subquery for execution.


gaussdb=# set rewrite_rule='predpushforce';
SET

gaussdb=# explain (costs off) select * from t1, (select sum(c2), c1 from t2 group by c1) st2 where st2.c1 = t1.c1;
                     QUERY PLAN
----------------------------------------------------
 Streaming (type: GATHER)
   Node/s: All datanodes
   ->  Nested Loop
         ->  Seq Scan on t1
         ->  HashAggregate
               Group By Key: t2.c1
               ->  Index Scan using t2_c1_idx on t2
                     Index Cond: (t1.c1 = c1)
(8 rows)

-- When used together with predpush hints, you can see that parameterized paths are used.

gaussdb=# set rewrite_rule = 'predpush';
SET
gaussdb=# explain (costs off) select * from t1, (select sum(c2), c1 from t2 group by c1) st2 where st2.c1 = t1.c1;
                     QUERY PLAN
----------------------------------------------------
 Streaming (type: GATHER)
   Node/s: All datanodes
   ->  Nested Loop
         ->  Seq Scan on t1
         ->  HashAggregate
               Group By Key: t2.c1
               ->  Result
                     Filter: (t1.c1 = t2.c1)
                     ->  Seq Scan on t2
(9 rows)

Forbidding Pullup of Subquery Parameter disablerep for Replication Tables

When querying a replication table, the query actually takes effect on a DN only. Pulling up the subquery parameter disablerep may deteriorate the performance. The following is an example:

gaussdb=# create table t_rep(a int) distribute by replication;
CREATE TABLE
gaussdb=# create table t_dis(a int);
NOTICE:  The 'DISTRIBUTE BY' clause is not specified. Using 'a' as the distribution column by default.
HINT:  Please use 'DISTRIBUTE BY' clause to specify suitable data distribution column.
CREATE TABLE
gaussdb=# set rewrite_rule = '';
SET
gaussdb=# explain (costs off) select * from t_dis where a = any(select a from t_rep) or a > 100;
                          QUERY PLAN
---------------------------------------------------------------
 Streaming (type: GATHER)
   Node/s: All datanodes
   ->  Hash Left Join
         Hash Cond: (t_dis.a = subquery.a)
         Filter: ((subquery.a IS NOT NULL) OR (t_dis.a > 100))
         ->  Seq Scan on t_dis
         ->  Hash
               ->  Subquery Scan on subquery
                     Filter: (Hash By subquery.a)
                     ->  HashAggregate
                           Group By Key: t_rep.a
                           ->  Seq Scan on t_rep
(12 rows)

For a replication table, the data stored on all DNs is the same. Therefore, you do not need to scan the replication table on all nodes.

gaussdb=# set rewrite_rule = disablerep;
SET
gaussdb=# explain (costs off) select * from t_dis where a = any(select a from t_rep) or a > 100;
                    QUERY PLAN
---------------------------------------------------
 Streaming (type: GATHER)
   Node/s: All datanodes
   ->  Seq Scan on t_dis
         Filter: ((hashed SubPlan 1) OR (a > 100))
         SubPlan 1
           ->  Seq Scan on t_rep
(6 rows

We use cookies to improve our site and your experience. By continuing to browse our site you accept our cookie policy. Find out more

Feedback

Feedback

Feedback

0/500

Selected Content

Submit selected content with the feedback