How Do I Troubleshoot Slow SQL Jobs?
If the job runs slowly, perform the following steps to find the causes and rectify the fault:
Possible Cause 1: Full GC
Check whether the problem is caused by FullGC.
- Log in to the DLI console. In the navigation pane, choose Job Management > SQL Jobs.
- On the SQL Jobs page, locate the row that contains the target job and click More > View Log in the Operation column.
Figure 1 Viewing logs
- Obtain the folder of the archived logs in the OBS directory. The details are as follows:
- Go to the archive log file directory and download the gc.log.* log file.
- Open the downloaded gc.log.* log file, search for keyword Full GC, and check whether time records in the file are continuous and Full GC information is recorded repeatedly.
Figure 5 Full GC logs
Cause locating and solution
- Log in to the DLI console and go to the SQL editor page. On the SQL Editor page, select the queue and database of the faulty job.
- Run the following statement to check the number of files in the table and specify the table name.
select count(distinct fn) FROM (select input_file_name() as fn from table name) a
- If there are too many small files, rectify the fault by referring to How Do I Merge Small Files?.
Cause 2: There is a broadcast table.
- Log in to the DLI console. In the navigation pane, choose Job Management > SQL Jobs.
- On the SQL Jobs page, locate the row that contains the target job and click to view the job details and obtain the job ID.
Figure 6 Obtaining the job ID
- In the Operation column of the job, click Spark UI.
- On the displayed page, choose SQL from the menu bar. Click the hyperlink in the Description column of the row that contains the job ID.
Figure 7 Clicking the job link
- View the DAG of the job to check whether the BroadcastNestedLoopJoin node exists.
Figure 8 DAG
- If the BroadcastNestedLoopJoin node exists, refer to Why Does a SQL Job That Has Join Operations Stay in the Running State? to rectify the fault.
Possible Cause 2: Data Skew
Check whether the problem is caused by data skew.
- Log in to the DLI console. In the navigation pane, choose Job Management > SQL Jobs.
- On the SQL Jobs page, locate the row that contains the target job and click to view the job details and obtain the job ID.
Figure 9 Obtaining the job ID
- In the Operation column of the job, click Spark UI.
- On the displayed page, choose SQL from the menu bar. Click the hyperlink in the Description column of the row that contains the job ID.
- View the running status of the current stage in the Active Stage table on the displayed page. Click the hyperlink in the Description column.
- View the Launch Time and Duration of each task.
- Click Duration to sort tasks. Check whether the overall job duration is prolonged because a task has taken a long time.
According to Figure 10, when data skew occurs, the data volume of shuffle reads of a task is much greater than that of other tasks.
Cause locating and solution
- Perform group by and count on a join to collect statistics on the number of key values of each join. The following is an example:
Join table lefttbl and table righttbl. num in the lefttbl table is the key value of the join. You can perform group by and count on lefttbl.num.
SELECT * FROM lefttbl a LEFT join righttbl b on a.num = b.int2; SELECT count(1) as count,num from lefttbl group by lefttbl.num ORDER BY count desc;
Figure 11 shows the result. There are much more num parameters whose value are 1 than other values. - Use concat(cast(round(rand() * 999999999) as string) to generate a random number for each key value.
- If the skew is serious and random numbers cannot be generated, see How Do I Eliminate Data Skew by Configuring AE Parameters?
O&M Guide FAQs
- How Do I Troubleshoot Slow SQL Jobs?
- How Do I View DLI SQL Logs?
- How Do I View SQL Execution Records?
- How Do I Eliminate Data Skew by Configuring AE Parameters?
- What Can I Do If a Table Cannot Be Queried on the DLI Console?
- The Compression Ratio of OBS Tables Is Too High
- How Can I Avoid Garbled Characters Caused by Inconsistent Character Codes?
- Do I Need to Grant Table Permissions to a User and Project After I Delete a Table and Create One with the Same Name?
- Why Can't I Query Table Data After Data Is Imported to a DLI Partitioned Table Because the File to Be Imported Does Not Contain Data in the Partitioning Column?
- How Do I Fix the Data Error Caused by CRLF Characters in a Field of the OBS File Used to Create an External OBS Table?
- Why Does a SQL Job That Has Join Operations Stay in the Running State?
- The on Clause Is Not Added When Tables Are Joined. Cartesian Product Query Causes High Resource Usage of the Queue, and the Job Fails to Be Executed
- Why Can't I Query Data After I Manually Add Data to the Partition Directory of an OBS Table?
- Why Is All Data Overwritten When insert overwrite Is Used to Overwrite Partitioned Table?
- Why Is a SQL Job Stuck in the Submitting State?
- Why Is the create_date Field in the RDS Table Is a Timestamp in the DLI query result?
- What Can I Do If datasize Cannot Be Changed After the Table Name Is Changed in a Finished SQL Job?
- Why Is the Data Volume Changes When Data Is Imported from DLI to OBS?
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbotmore