Help Center/ MapReduce Service/ Component Operation Guide (Normal)/ Using Spark2x/ Common Issues About Spark2x/ Spark Core/ What Can I Do If Shuffle Fetch Fails Due to the "Timeout Waiting for Task" Exception?
Updated on 2022-09-15 GMT+08:00

What Can I Do If Shuffle Fetch Fails Due to the "Timeout Waiting for Task" Exception?

Question

When I execute a 100 TB TPC-DS test suite in the JDBCServer mode, the "Timeout waiting for task" is displayed. As a result, shuffle fetch fails, the stage keeps retrying, and the task cannot be completed properly. What can I do?

Answer

The ShuffleService function is used in JDBCServer mode. In the reduce phase, all executors obtain data from NodeManager. When the data volume reaches a level (more than 10 TB), the NodeManager may reach the bottleneck (ShuffleService is in the NodeManager process). As a result, some tasks for obtaining data time out. Therefore, the problem occurs.

You are advised to disable ShuffleService for Spark tasks whose data volume is greater than 10 TB. That is, set spark.shuffle.service.enabled in the Spark-defaults.conf configuration file to false.