How Do I Troubleshoot Slow SQL Jobs?

If the job runs slowly, perform the following steps to find the causes and rectify the fault:

Check whether the problem is caused by FullGC.

Log in to the DLI console. In the navigation pane, choose Job Management > SQL Jobs.
On the SQL Jobs page, locate the row that contains the target job and click More > View Log in the Operation column.
Figure 1 Viewing logs
Obtain the folder of the archived logs in the OBS directory. The details are as follows:
- Spark SQL jobs:
  Locate the log folder whose name contains driver or container_ xxx _000001.
  Figure 2 Example folder name
  
  Figure 3 Example folder name
- Spark Jar jobs:
  The archive log folder of a Spark Jar job starts with batch.
  Figure 4 Example folder name
Go to the archive log file directory and download the gc.log.* log file.
Open the downloaded gc.log.* log file, search for keyword Full GC, and check whether time records in the file are continuous and Full GC information is recorded repeatedly.
Figure 5 Full GC logs

Cause locating and solution

Cause 1: There are too many small files in a table.

Log in to the DLI console and go to the SQL editor page. On the SQL Editor page, select the queue and database of the faulty job.
Run the following statement to check the number of files in the table and specify the table name.
```
select count(distinct fn)  FROM
(select input_file_name() as fn from table name) a
```
If there are too many small files, rectify the fault by referring to How Do I Merge Small Files?.

Cause 2: There is a broadcast table.

Log in to the DLI console. In the navigation pane, choose Job Management > SQL Jobs.
On the SQL Jobs page, locate the row that contains the target job and click to view the job details and obtain the job ID.
Figure 6 Obtaining the job ID
In the Operation column of the job, click Spark UI.
On the displayed page, choose SQL from the menu bar. Click the hyperlink in the Description column of the row that contains the job ID.
Figure 7 Clicking the job link
View the DAG of the job to check whether the BroadcastNestedLoopJoin node exists.
Figure 8 DAG
If the BroadcastNestedLoopJoin node exists, refer to Why Does a SQL Job That Has Join Operations Stay in the Running State? to rectify the fault.

Check whether the problem is caused by data skew.

Log in to the DLI console. In the navigation pane, choose Job Management > SQL Jobs.
On the SQL Jobs page, locate the row that contains the target job and click to view the job details and obtain the job ID.
Figure 9 Obtaining the job ID
In the Operation column of the job, click Spark UI.
On the displayed page, choose SQL from the menu bar. Click the hyperlink in the Description column of the row that contains the job ID.
View the running status of the current stage in the Active Stage table on the displayed page. Click the hyperlink in the Description column.
View the Launch Time and Duration of each task.
Click Duration to sort tasks. Check whether the overall job duration is prolonged because a task has taken a long time.
According to Figure 10, when data skew occurs, the data volume of shuffle reads of a task is much greater than that of other tasks.
Figure 10 Data skew

Cause locating and solution

Shuffle data skew is caused by unbalanced number of key values in join.

Perform group by and count on a join to collect statistics on the number of key values of each join. The following is an example:
Join table lefttbl and table righttbl. num in the lefttbl table is the key value of the join. You can perform group by and count on lefttbl.num.
```
SELECT * FROM lefttbl a LEFT join righttbl b on a.num = b.int2;
SELECT count(1) as count,num from lefttbl  group by lefttbl.num ORDER BY count desc;
```
Figure 11 shows the result. There are much more num parameters whose value are 1 than other values.
Figure 11 Statistics on num values
Use concat(cast(round(rand() * 999999999) as string) to generate a random number for each key value.
If the skew is serious and random numbers cannot be generated, see How Do I Eliminate Data Skew by Configuring AE Parameters?

Parent topic: O&M Guide

Thank you very much for your feedback. We will continue working to improve the documentation.

The system is busy. Please try again later.

Which of the following issues have you encountered?

Content is inconsistent with the product UI

Unclear descriptions

Lack of examples or code

Incorrect steps

Can't find what I need

Lack of best practices

Feedback (optional)

0/500

Select at least one type of issue, and enter your comments or suggestions.

Enter a maximum of 500 characters.

Submit Cancel