Enabling HCCL Communication Operator-Level Re-execution for Supernodes
Scenario
To address the high failure rate of optical modules under Snt9B23 supernodes, the stability and reliability of the system are improved by introducing a re-execution mechanism at the Huawei Collective Communication Library (HCCL) communication operator level.
HCCL, a distributed communication library designed by Huawei for Ascend AI processors, aims to optimize efficient collaboration between multiple devices and accelerate distributed training of deep learning models, applicable to AI scenarios where large-scale compute is required. In distributed training, HCCL is responsible for coordinating data synchronization (such as gradient aggregation and parameter update) between multiple Ascend processors, reducing communication overheads and improving training efficiency.
Constraints
- Only Snt9b23 supernodes are supported.
- Enabling operator re-execution slightly affects the performance.
- Re-execution depends on the VPC plane (non-parameter plane) network for status negotiation within the communication domain. If the VPC planes are different, re-execution cannot be performed.
- For the HCCS plane, if the link is not recovered and the route is not converged, re-execution cannot be performed.
- Re-execution depends on that all cards in a communication domain stop at the same communication operator when a fault occurs. Otherwise, re-execution cannot be performed. The success rate is about 95%.
- Using the communication operator in inplace mode may cause UserIn data to be polluted, affecting the reliability of re-execution. Although 80% of communication operators can be re-executed in the inplace mode, there are exceptions, for example, for all_reduce, all_gather, and reduce_scatter operators in the Torch framework.
- For RoH/RoCE failover (lane borrowing) caused by intermittent disconnection or link disconnection, re-execution can be performed only once in the same communication domain, and switchback is not supported. During the failover, services can be continued. However, you should save checkpoints and rectify faults in a timely manner.
- The following table lists the supported HCCL re-execution scope for the current Ascend execution mode.
Table 1 HCCL re-execution scope Mode
HCCL Communication Operator Unfolding Mode
Supported
Single-operator
Stars
Supported
Ffts+
Supported
AI CPU unfolding
Supported
Integrate communication and computing (mc2)
Not supported
Graph mode
Full POD mode, in which communication operators are integrated as expanded tasks.
Not supported
Full POD mode, in which HCCL is not involved in the graph execution process and cannot be re-executed.
AI CPU unfolding
Supported
Principles
The connection system of the Snt9B23 supernode mainly includes two transmission planes: HCCS plane and RoH/RoCE plane.
On the HCCS plane, the optical interconnection technology is used between L1-1520 and L2-1520. On the RoH/RoCE plane, optical interconnection is used for parts beyond the NPU range. The fault rate of the electrical interconnection domain is relatively low. Therefore, this mechanism is mainly used to handle optical module faults in the optical interconnection domain. Specifically:
- Faulty optical module between L1-1520 and L2-1520 on the HCCS plane
- Faulty optical module of the Snt9B23 out of the RoH/RoCE plane
HCCS plane
For the HCCS plane, if the optical module between L1 and L2 is intermittently disconnected or disconnected, the 1520 device automatically switches the path (provided that multiple paths exist). However, link disconnection may cause packet loss and further service interruption. In this case, the framework layer rolls back to the previous checkpoint for resumable training. By introducing the HCCL re-execution mechanism, returning to the checkpoint for resumable training may be effectively reduced after 1520 completes path switching, further improving service continuity and reliability.
RoH/RoCE plane
For the RoH/RoCE plane, the protocol has a built-in retransmission mechanism at the transport layer, which can rectify packet loss or intermittent disconnection. However, the reliability of this mechanism is still limited. To enhance the overall reliability, the re-execution mechanism is introduced at the HCCL layer. When an intermittent disconnection lasts for more than 30 seconds or a link disconnection occurs, the system establishes a new transmission path (lane borrowing) and starts the re-execution process at the operator level, ensuring service stability.
Parameter Configuration (HCCL_OP_RETRY_ENABLE)
The environment variable HCCL_OP_RETRY_ENABLE is used to configure whether to enable HCCL operator re-execution. Re-execution refers to the process in which HCCL attempts to re-execute the communication operator when the communication operator reports an SDMA or RDMA CQE error. This feature can effectively avoid communication interruption caused by hardware intermittent disconnection and improve communication stability.
The re-execution feature can be configured in the communication domains at the following physical layers:
- L0: communication domain within a server
- L1: communication domain between servers
- L2: communication domain between supernodes
Configuration:
Before running a training job, run the following command on the server node:
export HCCL_OP_RETRY_ENABLE="L0:0, L1:1, L2:1"
Parameter |
Description |
Value Range |
Default Value |
Recommended Value |
---|---|---|---|---|
L0 |
Communication domain within a server |
|
0 |
0 |
L1 |
Communication domain between servers |
|
0 |
1 |
L2 |
Communication domain between supernodes |
|
0 |
1 |
Note:
- When L2 is set to 1, the communication between supernodes can be performed using the standby device NIC when the device NIC is faulty. The standby NIC is the NIC of the other die in the same NPU.
- If the communication domain is created based on the ranktable, you need to configure the standby NIC using the backup device ip parameter in the ranktable file.
- If the communication domain is created based on the root broadcast, the two dies of the same NPU are automatically configured as the standby NICs of each other. No manual configuration is required.
Parameter Configuration (HCCL_OP_RETRY_PARAMS)
The environment variable HCCL_OP_RETRY_ENABLE is used to configure the parameters for re-executing the HCCL operator, including the maximum number of re-executions, the waiting time for the first re-execution, and the interval between two re-executions.
Configuration example
export HCCL_OP_RETRY_PARAMS="MaxCnt:3, HoldTime:5000, IntervalTime:1000"
Parameter |
Description |
Type |
Value Range |
Default Value |
Unit |
Recommended Value |
---|---|---|---|---|---|---|
MaxCnt |
Maximum number of re-executions |
uint32 |
[1, 10] |
3 |
Count |
Retain the default value 3. |
HoldTime |
Waiting time from the time when a communication operator execution failure is detected to the time when the operator is re-executed for the first time |
uint32 |
[0, 60000] |
5000 |
ms |
Retain the default value 5000. |
IntervalTime |
Interval between two re-executions |
uint32 |
[0, 60000] |
1000 |
ms |
Retain the default value 1000. |
Constraints:
This environment variable takes effect only when the HCCL re-execution feature is enabled (at any layer) using the HCCL_OP_RETRY_ENABLE environment variable.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot