Help Center/ ModelArts/ Troubleshooting/ Training Jobs/ Running a Training Job Failed/ An NCCL Error Occurs When a Training Job Fails to Be Executed
Updated on 2024-04-11 GMT+08:00

An NCCL Error Occurs When a Training Job Fails to Be Executed

Symptom

The training job fails to be executed. The training job logs contain NCCL-related errors, such as "NCCL timeout", "RuntimeError: NCCL communicator was aborted on rank 7", "NCCL WARN Bootstrap: no socket interface found", and "NCCL INFO Call to connect returned Connection refused, retrying".

Possible Causes

NCCL is a library that provides primitives for communication between GPUs. It implements collective communication and point-to-point send/receive primitives. If a training job reports an NCCL error, you can adjust the NCCL environment variables to solve the problem.

Solution

  1. Go to the details page of the training job, click the Logs tab, and view the NCCL error.
    • If the error message NCCL timeout or RuntimeError: NCCL communicator was aborted on rank 7 is displayed, InfiniBand Verbs times out. Click Rebuild in the upper right corner to create a training job again. Set the environment variable NCCL_IB_TIMEOUT to 22. Submit the training job and wait until the job is completed.
    • If the error message NCCL WARN Bootstrap : no socket interface found or NCCL INFO Call to connect returned Connection refused, retrying is displayed, NCCL cannot find the communication network adapter or access the IP address. Check whether the NCCL_SOCKET_IFNAME environment variable is set in the training code. This environment variable is automatically injected by the system and does not need to be set in the training code. After the NCCL_SOCKET_IFNAME environment variable is removed from the training code, click Rebuild in the upper right corner to create a training job again. After the training job is submitted, wait until the job is completed.
  2. Wait and check whether the status of the training job changes to Completed.
    • If yes, no further action is required.
    • If no, contact technical support to check the node status.

Summary and Suggestions

  • The NCCL_SOCKET_IFNAME environment variable is used to specify the name of the network adapter for communication. NCCL_SOCKET_IFNAME=eth0 means that only the eth0 network adapter is used for communication. This environment variable is automatically injected by the system. Because the name of the communication network adapter is not fixed, this environment variable should not be set by default in the training code.
  • The NCCL_IB_TIMEOUT environment variable is used to control InfiniBand Verbs timeout. The default value used by NCCL is 18. The value ranges from 1 to 22.