Error Message "RuntimeError: connect() timed out" Is Displayed in Logs
Symptom
When PyTorch is used for distributed training, error message "RuntimeError: connect() timed out" is displayed in logs.
Possible Causes
The possible causes are as follows:
If data had been copied before this issue occurred, data replication of all nodes was not completed at the same time. If you executed torch.distributed.init_process_group() when data replication was still in progress on certain nodes, the execution timed out.
Solution
import moxing as mox import torch torch.distributed.init_process_group() if local_rank == 0: mox.file.copy_parallel(src,dst) torch.distributed.barrier()
Summary and Suggestions
- Use the notebook environment for online debugging. For details, see Using JupyterLab to Develop Models.
- Use a local IDE (PyCharm or VS Code) to access the cloud environment for debugging. For details, see Using a Local IDE to Develop Models.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot