Help Center/ ModelArts/ Troubleshooting/ Training Jobs/ Service Code Issues/ Error Message "no socket interface found" Is Displayed in Logs
Updated on 2025-06-06 GMT+08:00

Error Message "no socket interface found" Is Displayed in Logs

Symptom

An NCCL debug log level is set in a distributed job executed using a PyTorch image.
import os
os.environ["NCCL_DEBUG"] = "INFO"

The following error message is displayed.

job0879f61e-job-base-pda-2-0:712:71 2 [0] bootstrap.cc:37 NCCL WARN Bootstrap : no socket interface found
job0879f61e-job-base-pda-2-0:712:712 [0] NCCL INFO init.cC:128 -> 3
job0879f61e-job-base-pda-2-0:712:712 [0] NCCL INFO bootstrap.cc:76 -> 3
job0879f61e-job-base-pda-2-0:712:712 [0] NCCL INFO bootstrap.cc:245 -> 3	job0879f61e-job-base-pda-2-0:712:712 [0] NCCL INFO bootstrap.cc:266 -> 3
Traceback (most recent call last):
	File "train_net.py", line 1923, in <module>
		main_worker(args)
	File "train net.py", line 355, in main_ worker
		network = torch.nn.parallel.DistributedDataParallel(network, device_ids=device_ids, find_unused _parameters=True)
	File "/home/work/anaconda/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 298, in init_self.broadcast bucket_size)
	File "/home/work/anaconda/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 480, in _distributed broacIcast coalesced	dist. broadcast coalesced(seIf.process group, tensors, buffer size)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:374, internal error

Possible Causes

Possible causes are as follows:

  • Cause 1: The environment variables NCCL_IB_TC, NCCL_IB_GID_INDEX, and NCCL_IB_TIMEOUT are not configured. As a result, the communication is slow and unstable, and the InfiniBand communication is interrupted.
  • Cause 2: NCCL_SOCKET_IFNAME is incorrectly set. If the NCCL version is earlier than 2.14, you need to manually set the NCCL_SOCKET_IFNAME environment variable.

Solution

  • For cause 1, add the following environment variables to the code:
    import os
    os.environ["NCCL_IB_TC"] = "128"
    os.environ["NCCL_IB_GID_INDEX"] = "3"
    os.environ["NCCL_IB_TIMEOUT"] = "22"
  • For cause 2, set the NCCL_SOCKET_IFNAME environment variable in the code.
    import os
    os.environ["NCCL_SOCKET_IFNAME"] = "eth0"

    The preceding settings are required only when the NCCL version is earlier than 2.14.