Help Center/
ModelArts/
Troubleshooting/
Training Jobs/
Service Code Issues/
Error Message "no socket interface found" Is Displayed in Logs
Updated on 2025-06-06 GMT+08:00
Error Message "no socket interface found" Is Displayed in Logs
Symptom
An NCCL debug log level is set in a distributed job executed using a PyTorch image.
import os os.environ["NCCL_DEBUG"] = "INFO"
The following error message is displayed.
job0879f61e-job-base-pda-2-0:712:71 2 [0] bootstrap.cc:37 NCCL WARN Bootstrap : no socket interface found job0879f61e-job-base-pda-2-0:712:712 [0] NCCL INFO init.cC:128 -> 3 job0879f61e-job-base-pda-2-0:712:712 [0] NCCL INFO bootstrap.cc:76 -> 3 job0879f61e-job-base-pda-2-0:712:712 [0] NCCL INFO bootstrap.cc:245 -> 3 job0879f61e-job-base-pda-2-0:712:712 [0] NCCL INFO bootstrap.cc:266 -> 3 Traceback (most recent call last): File "train_net.py", line 1923, in <module> main_worker(args) File "train net.py", line 355, in main_ worker network = torch.nn.parallel.DistributedDataParallel(network, device_ids=device_ids, find_unused _parameters=True) File "/home/work/anaconda/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 298, in init_self.broadcast bucket_size) File "/home/work/anaconda/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 480, in _distributed broacIcast coalesced dist. broadcast coalesced(seIf.process group, tensors, buffer size) RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:374, internal error
Possible Causes
Possible causes are as follows:
- Cause 1: The environment variables NCCL_IB_TC, NCCL_IB_GID_INDEX, and NCCL_IB_TIMEOUT are not configured. As a result, the communication is slow and unstable, and the InfiniBand communication is interrupted.
- Cause 2: NCCL_SOCKET_IFNAME is incorrectly set. If the NCCL version is earlier than 2.14, you need to manually set the NCCL_SOCKET_IFNAME environment variable.
Solution
- For cause 1, add the following environment variables to the code:
import os os.environ["NCCL_IB_TC"] = "128" os.environ["NCCL_IB_GID_INDEX"] = "3" os.environ["NCCL_IB_TIMEOUT"] = "22"
- For cause 2, set the NCCL_SOCKET_IFNAME environment variable in the code.
import os os.environ["NCCL_SOCKET_IFNAME"] = "eth0"
The preceding settings are required only when the NCCL version is earlier than 2.14.
Parent topic: Service Code Issues
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
The system is busy. Please try again later.
For any further questions, feel free to contact us through the chatbot.
Chatbot