Help Center/ MapReduce Service/ Troubleshooting/ Using Spark/ Large Number of Shuffle Results Are Lost During Spark Task Execution
Updated on 2022-09-14 GMT+08:00

Large Number of Shuffle Results Are Lost During Spark Task Execution

Issue

Spark tasks fail to be executed. The task log shows that a large number of shuffle files are lost.

Symptom

Spark tasks fail to be executed. The task log shows that a large number of shuffle files are lost.

Cause Analysis

When Spark is running, the shuffle file generated temporarily is stored in the temporary directory of the executor for later use.

When an executor exits abnormally, NodeManager deletes the temporary directory of the container where the executor is located. When other executors apply for the shuffle result of the executor, a message is displayed indicating that the file cannot be found.

Therefore, you need to check whether the executor exits abnormally. You can check whether there are executors in the dead state on the executors tab page on the Spark task page and view the executor logs of each dead state, determine the cause of abnormal exit. Some executors may exit because the shuffle file cannot be found. You need to find the earliest executor that exits abnormally.

Common abnormal exit causes:

  • OOM occurs on the executor.
  • Multiple tasks fail when the executor is running.
  • The node where the executor is located is cleared.

Procedure

Adjust or modify the task parameters or code based on the actual cause of the abnormal exit of the executor, and run the Spark task again.