How Do I Troubleshoot Slow SQL Jobs?
If the job runs slowly, perform the following steps to find the causes and rectify the fault:
Possible Cause 1: Full GC
Check whether the problem is caused by FullGC.
- Log in to the DLI console. In the navigation pane, choose Job Management > SQL Jobs.
- On the SQL Jobs page, locate the row that contains the target job and click More > View Log in the Operation column.
    Figure 1 Viewing logs  
- Obtain the folder of the archived logs in the OBS directory. The details are as follows:
- Go to the archive log file directory and download the gc.log.* log file.
- Open the downloaded gc.log.* log file, search for keyword Full GC, and check whether time records in the file are continuous and Full GC information is recorded repeatedly.
    Figure 5 Full GC logs  
Cause locating and solution
- Log in to the DLI console and go to the SQL editor page. On the SQL Editor page, select the queue and database of the faulty job.
- Run the following statement to check the number of files in the table and specify the table name.
     select count(distinct fn) FROM (select input_file_name() as fn from table name) a 
- If there are too many small files, rectify the fault by referring to How Do I Merge Small Files?.
Cause 2: There is a broadcast table.
- Log in to the DLI console. In the navigation pane, choose Job Management > SQL Jobs.
- On the SQL Jobs page, locate the row that contains the target job and click  to view the job details and obtain the job ID.
    Figure 6 Obtaining the job ID to view the job details and obtain the job ID.
    Figure 6 Obtaining the job ID  
- In the Operation column of the job, click Spark UI.
- On the displayed page, choose SQL from the menu bar. Click the hyperlink in the Description column of the row that contains the job ID.
    Figure 7 Clicking the job link  
- View the DAG of the job to check whether the BroadcastNestedLoopJoin node exists.
    Figure 8 DAG  
- If the BroadcastNestedLoopJoin node exists, refer to Why Does a SQL Job That Has Join Operations Stay in the Running State? to rectify the fault.
Possible Cause 2: Data Skew
Check whether the problem is caused by data skew.
- Log in to the DLI console. In the navigation pane, choose Job Management > SQL Jobs.
- On the SQL Jobs page, locate the row that contains the target job and click  to view the job details and obtain the job ID.
    Figure 9 Obtaining the job ID to view the job details and obtain the job ID.
    Figure 9 Obtaining the job ID  
- In the Operation column of the job, click Spark UI.
- On the displayed page, choose SQL from the menu bar. Click the hyperlink in the Description column of the row that contains the job ID.
     
- View the running status of the current stage in the Active Stage table on the displayed page. Click the hyperlink in the Description column.
     
- View the Launch Time and Duration of each task.
- Click Duration to sort tasks. Check whether the overall job duration is prolonged because a task has taken a long time.
    According to Figure 10, when data skew occurs, the data volume of shuffle reads of a task is much greater than that of other tasks.
Cause locating and solution
- Perform group by and count on a join to collect statistics on the number of key values of each join. The following is an example:
     Join table lefttbl and table righttbl. num in the lefttbl table is the key value of the join. You can perform group by and count on lefttbl.num. SELECT * FROM lefttbl a LEFT join righttbl b on a.num = b.int2; SELECT count(1) as count,num from lefttbl group by lefttbl.num ORDER BY count desc; Figure 11 shows the result. There are much more num parameters whose value are 1 than other values.
- Use concat(cast(round(rand() * 999999999) as string) to generate a random number for each key value.
- If the skew is serious and random numbers cannot be generated, see How Do I Do When Data Skew Occurs During the Execution of a SQL Job?
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot 
     
      



