Error Message "RuntimeError: connect() timed out" Displayed in Logs
Symptom
When PyTorch is used for distributed training, the following error occurs.
Possible Causes
If data is copied before this issue occurs, data copy on all nodes is not complete at the same time. If you perform torch.distributed.init_process_group() when data copy is still in progress on certain nodes, the connection timed out.
Solution
import moxing as mox import torch torch.distributed.init_process_group() if local_rank == 0: mox.file.copy_parallel(src,dst) torch.distributed.barrier()
Summary and Suggestions
- Use the online notebook environment for debugging. For details, see Using JupyterLab to Develop a Model.
- Use the local IDE (PyCharm or VS Code) to access the cloud environment for debugging. For details, see Using the Local IDE to Develop a Model.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot