Large Number of Shuffle Results Are Lost During Spark Task Execution
Issue
Spark tasks fail to be executed. The task log shows that a large number of shuffle files are lost.
Symptom
Spark tasks fail to be executed. The task log shows that a large number of shuffle files are lost.
Cause Analysis
When Spark is running, the shuffle file generated temporarily is stored in the temporary directory of the executor for later use.
When an executor exits abnormally, NodeManager deletes the temporary directory of the container where the executor is located. When other executors apply for the shuffle result of the executor, a message is displayed indicating that the file cannot be found.
Therefore, you need to check whether the executor exits abnormally. You can check whether there are executors in the dead state on the executors tab page on the Spark task page and view the executor logs of each dead state, determine the cause of abnormal exit. Some executors may exit because the shuffle file cannot be found. You need to find the earliest executor that exits abnormally.
Common abnormal exit causes:
- OOM occurs on the executor.
- Multiple tasks fail when the executor is running.
- The node where the executor is located is cleared.
Procedure
Adjust or modify the task parameters or code based on the actual cause of the abnormal exit of the executor, and run the Spark task again.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.