Help Center/ ModelArts/ FAQs/ ModelArts Standard Model Training/ What Should I Do If RuntimeError: Socket Timeout Is Displayed During Distributed Process Group Initialization using torchrun?
Updated on 2025-08-28 GMT+08:00

What Should I Do If RuntimeError: Socket Timeout Is Displayed During Distributed Process Group Initialization using torchrun?

If the RuntimeError: Socket Timeout error occurs during the distributed process group initialization using torchrun, you can add the following environment variables to create a training job again to view initialization details and further locate the fault.

  • LOGLEVEL=INFO
  • TORCH_CPP_LOG_LEVEL=INFO
  • TORCH_DISTRIBUTED_DEBUG=DETAIL

The RuntimeError: Socket Timeout error is caused by a significant time discrepancy between tasks when running the torchrun command. The time discrepancy is caused by initialization tasks, like downloading the training data and checkpoint read/write, which happen before the torchrun command is run. If the time taken to complete these initialization tasks varies significantly, a Socket Timeout error may occur. When this error happens, check the time difference between the torchrun execution points for each task. If the time difference is too large, optimize the initialization process before running the torchrun command to ensure a reasonable time gap.