SQL Self-Diagnosis

Background

A large number of queries are involved in services. Query problems at the planning and execution phases may appear, for example, whether the estimation is accurate, whether data skew occurs, and whether statistics are not collected and how to collect statistics. SQL self-diagnosis provides users with a more efficient and easy-to-use method for locating performance problems. It helps users simplify the SQL optimization process of batch processing jobs. After entering SQL statements, users can easily obtain SQL problems and optimization suggestions in batches, instead of executing complex optimization such as extracting and rewriting SQL statements in the job set one by one, reproducing the SQL statements that have performance problems, checking Explain Performance to locate the problem, rewriting the SQL statements, and adjusting parameters.

Function Description

Before executing a job, configure GUC parameters. For details, see "resource_track_level" and "resource_track_cost" in HUAWEI CLOUD Stack 8.0.2 GaussDB(DWS) Developer Guide. After running a user job, you can view the related system view to obtain the possible performance problems of the query job. These system views provide possible causes of performance problems. Based on the performance alarms, you can optimize the jobs with performance problems by referring to "SQL Self-Diagnosis" in HUAWEI CLOUD Stack 8.0.2 GaussDB(DWS) Developer Guide.

Technical Principles

Click to enlarge

[1] The CN queries and compiles SQL statements to generate a plan tree. The diagnosis analyzer diagnoses the plan tree to identify problems in query mode (including alarms that statistics are not collected and that SQL statements cannot be delivered).

[2] The DN executes the SQL statement and writes the statistics to the shared memory of the DN. If the current SQL is running in operator mode (all supported alarm scenarios can be diagnosed), collect operator execution statistics during SQL running.

[3] The DN executes the SQL statement and returns the execution result to the CN. If the operator mode is used, the DN returns the collected runtime status to the CN. The diagnosis analyzer analyzes the status and records the analysis result.

[4] Statistics in the memory are cleared every 3 minutes. To record all historical information, enable the GUC parameter enable_resource_record. The statistics in the memory is persisted to a specific system catalog every 3 minutes.

Benefits

Scenario Description	Benefits
Some column statistics are not collected.	Use ANALYZE to collect statistics and generate a better execution plan.
SQL statements are not pushed down.	Report the cause of the pushdown failure. Optimize SQL statements to push them down.
In a hash join, the larger table is used as the inner table.	A large volume of data is written to disks, which greatly affects the system performance. This problem can be avoided by optimizing the SQL statements.
Nestloop is used in a large-table equivalent join.	Use of Nestloop has a great impact on performance when the data volume is large. This problem can be avoided by optimizing the SQL statements.
A large table is broadcasted.	A large amount of data is transmitted on the network, which greatly affects the system performance. This problem can be avoided by optimizing the SQL statements.
Data skew	Some nodes become the system bottleneck.
Improper index	The index is improper. As a result, a large number of rows need to be scanned. In this case, modify indexes.
Inaccurate estimation	The estimated number of rows deviates greatly from the actual number. As a result, the selected plan is not optimal.