What Should I Do If RuntimeError: Socket Timeout Is Displayed During Distributed Process Group Initialization using torchrun?
If the RuntimeError: Socket Timeout error occurs during the distributed process group initialization using torchrun, you can add the following environment variables to create a training job again to view initialization details and further locate the fault.
- LOGLEVEL=INFO
- TORCH_CPP_LOG_LEVEL=INFO
- TORCH_DISTRIBUTED_DEBUG=DETAIL
The RuntimeError: Socket Timeout error is caused by a significant time discrepancy between tasks when running the torchrun command. The time discrepancy is caused by initialization tasks, like downloading the training data and checkpoint read/write, which happen before the torchrun command is run. If the time taken to complete these initialization tasks varies significantly, a Socket Timeout error may occur. When this error happens, check the time difference between the torchrun execution points for each task. If the time difference is too large, optimize the initialization process before running the torchrun command to ensure a reasonable time gap.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot