How Do I Troubleshoot Slow SQL Jobs?
If the job runs slowly, perform the following steps to find the causes and rectify the fault:
Possible Cause 1: Full GC
Check whether the problem is caused by FullGC.
- Log in to the DLI console. In the navigation pane, choose Job Management > SQL Jobs.
- On the SQL Jobs page, locate the row that contains the target job and click More > View Log in the Operation column.
Figure 1 Viewing logs
- Obtain the folder of the archived logs in the OBS directory. The details are as follows:
- Go to the archive log file directory and download the gc.log.* log file.
- Open the downloaded gc.log.* log file, search for keyword Full GC, and check whether time records in the file are continuous and Full GC information is recorded repeatedly.
Figure 5 Full GC logs
Cause locating and solution
- Log in to the DLI console and go to the SQL editor page. On the SQL Editor page, select the queue and database of the faulty job.
- Run the following statement to check the number of files in the table and specify the table name.
select count(distinct fn) FROM (select input_file_name() as fn from table name) a
- If there are too many small files, rectify the fault by referring to How Do I Merge Small Files?.
Cause 2: There is a broadcast table.
- Log in to the DLI console. In the navigation pane, choose Job Management > SQL Jobs.
- On the SQL Jobs page, locate the row that contains the target job and click to view the job details and obtain the job ID.
Figure 6 Obtaining the job ID
- In the Operation column of the job, click Spark UI.
- On the displayed page, choose SQL from the menu bar. Click the hyperlink in the Description column of the row that contains the job ID.
Figure 7 Clicking the job link
- View the DAG of the job to check whether the BroadcastNestedLoopJoin node exists.
Figure 8 DAG
- If the BroadcastNestedLoopJoin node exists, refer to Why Does a SQL Job That Has Join Operations Stay in the Running State? to rectify the fault.
Possible Cause 2: Data Skew
Check whether the problem is caused by data skew.
- Log in to the DLI console. In the navigation pane, choose Job Management > SQL Jobs.
- On the SQL Jobs page, locate the row that contains the target job and click to view the job details and obtain the job ID.
Figure 9 Obtaining the job ID
- In the Operation column of the job, click Spark UI.
- On the displayed page, choose SQL from the menu bar. Click the hyperlink in the Description column of the row that contains the job ID.
- View the running status of the current stage in the Active Stage table on the displayed page. Click the hyperlink in the Description column.
- View the Launch Time and Duration of each task.
- Click Duration to sort tasks. Check whether the overall job duration is prolonged because a task has taken a long time.
According to Figure 10, when data skew occurs, the data volume of shuffle reads of a task is much greater than that of other tasks.
Cause locating and solution
- Perform group by and count on a join to collect statistics on the number of key values of each join. The following is an example:
Join table lefttbl and table righttbl. num in the lefttbl table is the key value of the join. You can perform group by and count on lefttbl.num.
SELECT * FROM lefttbl a LEFT join righttbl b on a.num = b.int2; SELECT count(1) as count,num from lefttbl group by lefttbl.num ORDER BY count desc;
Figure 11 shows the result. There are much more num parameters whose value are 1 than other values. - Use concat(cast(round(rand() * 999999999) as string) to generate a random number for each key value.
- If the skew is serious and random numbers cannot be generated, see How Do I Eliminate Data Skew by Configuring AE Parameters?
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.