SQL Self-Diagnosis

Background

A large number of queries are involved in services. Query problems at the planning and execution phases may appear, for example, inaccurate estimation, data skews, and statistics not collected. SQL self-diagnosis provides users with a more efficient and easy-to-use method for locating performance problems. It helps users simplify the SQL optimization process of batch processing jobs. After entering SQL statements, users can easily obtain SQL problems and optimization suggestions in batches, instead of executing complex optimization such as extracting and rewriting SQL statements in the job set one by one, reproducing the SQL statements that have performance problems, checking Explain Performance to locate the problem, rewriting the SQL statements, and adjusting parameters.

Function

Before executing the job, configure the GUC parameters. For details, see sections "resource_track_level" and "resource_track_cost" in Developer Guide. After running a user job, you can view the related system views to obtain the possible performance problems of query jobs. These system views provide possible causes of performance problems. Based on these causes, you can optimize job performance by referring to section "SQL Self-Diagnosis" in Developer Guide.

Technical Principles

Click to enlarge

[1] The CN queries and compiles SQL statements to generate a plan tree. The diagnosis analyzer diagnoses the plan tree to identify problems in query mode (including alarms that statistics are not collected and that SQL statements cannot be delivered).

[2] DNs execute the SQL statement and writes the statistics to the shared memory of the DN. If the current SQL is running in operator mode (all supported alarm scenarios can be diagnosed), collect operator execution statistics during SQL running.

[3] DNs execute the SQL statement and returns the execution result to the CN. In the operator mode, the DN returns the collected runtime status to the CN. The diagnosis analyzer analyzes the status and records the analysis result.

[4] Statistics in the memory are cleared every 3 minutes. To record all historical information, enable the GUC parameter enable_resource_record. The statistics in the memory is persisted to a specific system catalog every 3 minutes.

Optimization

Scenario	Optimization
Statistics of a single column or multiple columns are not collected.	Run the analyze command to collect statistics and generate a better execution plan.
SQL statements are not pushed down.	Report the reason why the pushdown fails. Optimize SQL statements to push SQL statements down.
In a hash join, the larger table is used as the inner table.	A large amount of data is written to disks, which greatly affects the performance. Optimize SQL statements to avoid such situation.
nestloop is used in a large-table equivalent join.	Nestloop has a great impact on performance when the data volume is large. Optimize SQL statements to avoid such situation.
A large table is broadcasted.	A large amount of data is transmitted on the network, which greatly affects the performance. Optimize SQL statements to avoid such situation.
Data skew occurs.	Some nodes become the system bottleneck.
The index is improper.	The number of scanned rows is too large. Modify the index.
Estimation is inaccurate.	The difference between the estimated and actual number of rows is large. As a result, the selected plan is not the optimal one.