Help Center/ ModelArts/ Troubleshooting/ Training Jobs/ GP Issues/ Error Message "RuntimeError: connect() timed out" Is Displayed in Logs
Updated on 2025-08-22 GMT+08:00

Error Message "RuntimeError: connect() timed out" Is Displayed in Logs

Symptom

When PyTorch is used for distributed training, error message "RuntimeError: connect() timed out" is displayed in logs.

Possible Causes

The possible causes are as follows:

If data had been copied before this issue occurred, data replication of all nodes was not completed at the same time. If you executed torch.distributed.init_process_group() when data replication was still in progress on certain nodes, the execution timed out.

Solution

If the issue is caused by asynchronous data replication between nodes and no barrier occurs, execute torch.distributed.init_process_group() before copying data, copy data based on local_rank()==0, call torch.distributed.barrier(), and wait until data replication is complete on all nodes. For details, see the following code:
import moxing as mox
import torch

torch.distributed.init_process_group()
if local_rank == 0: 
    mox.file.copy_parallel(src,dst) 

torch.distributed.barrier()

Summary and Suggestions

Before creating a training job, use the ModelArts development environment to debug your training code and minimize migration errors.