Help Center> ModelArts> Troubleshooting> Training Jobs> Service Code Issues> Error Message "no socket interface found" Displayed in Logs
Updated on 2024-04-30 GMT+08:00

Error Message "no socket interface found" Displayed in Logs

Symptom

An NCCL debug log level is set in a distributed job executed using a PyTorch image.
import os
os.environ["NCCL_DEBUG"] = "INFO"

The following error message is displayed.

Figure 1 Error log

Possible Causes

The environment variables NCCL_IB_TC, NCCL_IB_GID_INDEX, and NCCL_IB_TIMEOUT are not configured. As a result, the communication is slow and unstable, and the IB communication is interrupted.

Solution

Add environment variables to the code.

import os
os.environ["NCCL_IB_TC"] = "128"
os.environ["NCCL_IB_GID_INDEX"] = "3"
os.environ["NCCL_IB_TIMEOUT"] = "22"