Enabling HCCL Communication Operator-Level Re-execution for Supernodes

Scenario

To address the high failure rate of optical modules under Snt9b23 supernodes, the stability and reliability of the system are improved by introducing a re-execution mechanism at the Huawei Collective Communication Library (HCCL) communication operator level.

HCCL, a distributed communication library designed by Huawei for Ascend AI processors, aims to optimize efficient collaboration between multiple devices and accelerate distributed training of deep learning models, applicable to AI scenarios where large-scale compute is required. In distributed training, HCCL is responsible for coordinating data synchronization (such as gradient aggregation and parameter update) between multiple Ascend processors, reducing communication overheads and improving training efficiency.

Constraints

Only Snt9b23 supernodes are supported.
Enabling operator re-execution slightly affects the performance.
Re-execution depends on the VPC plane (non-parameter plane) network for status negotiation within the communication domain. If the VPC planes are different, re-execution cannot be performed.
For the HCCS plane, if the link is not recovered and the route is not converged, re-execution cannot be performed.
Re-execution depends on that all cards in a communication domain stop at the same communication operator when a fault occurs. Otherwise, re-execution cannot be performed. The success rate is about 95%.
Using the communication operator in inplace mode may cause UserIn data to be polluted, affecting the reliability of re-execution. Although 80% of communication operators can be re-executed in the inplace mode, there are exceptions, for example, for all_reduce, all_gather, and reduce_scatter operators in the Torch framework.
For RoH/RoCE failover (lane borrowing) caused by intermittent disconnection or link disconnection, re-execution can be performed only once in the same communication domain, and switchback is not supported. During the failover, services can be continued. However, you should save checkpoints and rectify faults in a timely manner.

The following table lists the supported HCCL re-execution scope for the current Ascend execution mode.

**Table 1** HCCL re-execution scope
Mode	HCCL Communication Operator Unfolding Mode	Supported
Single-operator	Stars	Supported
	Ffts+	Supported
	AI CPU unfolding	Supported
	Integrate communication and computing (mc2)	Not supported
Graph mode	Full POD mode, in which communication operators are integrated as expanded tasks.	Not supported Full POD mode, in which HCCL is not involved in the graph execution process and cannot be re-executed.
Graph mode	AI CPU unfolding	Supported

Principles

The connection system of the Snt9b23 supernode mainly includes two transmission planes: HCCS plane and RoH/RoCE plane.

On the HCCS plane, the optical interconnection technology is used between L1-1520 and L2-1520. On the RoH/RoCE plane, optical interconnection is used for parts beyond the NPU range. The fault rate of the electrical interconnection domain is relatively low. Therefore, this mechanism is mainly used to handle optical module faults in the optical interconnection domain. Specifically:

Faulty optical module between L1-1520 and L2-1520 on the HCCS plane
Faulty optical module of the Snt9b23 out of the RoH/RoCE plane

HCCS plane

For the HCCS plane, if the optical module between L1 and L2 is intermittently disconnected or disconnected, the 1520 device automatically switches the path (provided that multiple paths exist). However, link disconnection may cause packet loss and further service interruption. In this case, the framework layer rolls back to the previous checkpoint for resumable training. By introducing the HCCL re-execution mechanism, returning to the checkpoint for resumable training may be effectively reduced after 1520 completes path switching, further improving service continuity and reliability.

RoH/RoCE plane

For the RoH/RoCE plane, the protocol has a built-in retransmission mechanism at the transport layer, which can rectify packet loss or intermittent disconnection. However, the reliability of this mechanism is still limited. To enhance the overall reliability, the re-execution mechanism is introduced at the HCCL layer. When an intermittent disconnection lasts for more than 30 seconds or a link disconnection occurs, the system establishes a new transmission path (lane borrowing) and starts the re-execution process at the operator level, ensuring service stability.

Parameter Configuration (HCCL_OP_RETRY_ENABLE)

The environment variable HCCL_OP_RETRY_ENABLE is used to configure whether to enable HCCL operator re-execution. Re-execution refers to the process in which HCCL attempts to re-execute the communication operator when the communication operator reports an SDMA or RDMA CQE error. This feature can effectively avoid communication interruption caused by hardware intermittent disconnection and improve communication stability.

The re-execution feature can be configured in the communication domains at the following physical layers:

L0: communication domain within a server
L1: communication domain between servers
L2: communication domain between supernodes

Configuration:

Before running a training job, run the following command on the server node:

export HCCL_OP_RETRY_ENABLE="L0:0, L1:1, L2:1"

**Table 2** Parameters
Parameter	Description	Value Range	Recommended Value
L0	Communication domain within a server	0: Re-execution is disabled for communication tasks in the communication domain within a server. 1: Re-execution is enabled for communication tasks in the communication domain within a server.	0
L1	Communication domain between servers	0: Re-execution is disabled for communication tasks in the communication domain between servers. 0 is the default value. 1: Re-execution is enabled for communication tasks in the communication domain between servers.	1
L2	Communication domain between supernodes	0: Re-execution is disabled for communication tasks in the communication domain between supernodes. 0 is the default value. 1: Re-execution is enabled for communication tasks in the communication domain between supernodes.	1

Note:

When L2 is set to 1, the communication between supernodes can be performed using the standby device NIC when the device NIC is faulty. The standby NIC is the NIC of the other die in the same NPU.
If the communication domain is created based on the ranktable, you need to configure the standby NIC using the backup device ip parameter in the ranktable file.
If the communication domain is created based on the root broadcast, the two dies of the same NPU are automatically configured as the standby NICs of each other. No manual configuration is required.

Parameter Configuration (HCCL_OP_RETRY_PARAMS)

The environment variable HCCL_OP_RETRY_ENABLE is used to configure the parameters for re-executing the HCCL operator, including the maximum number of re-executions, the waiting time for the first re-execution, and the interval between two re-executions.

Configuration example

export HCCL_OP_RETRY_PARAMS="MaxCnt:3, HoldTime:5000, IntervalTime:1000"

**Table 3** Parameters
Parameter	Description	Type	Value Range	Default Value	Unit	Recommended Value
MaxCnt	Maximum number of re-executions	uint32	[1, 10]	3	Count	Retain the default value 3.
HoldTime	Waiting time from the time when a communication operator execution failure is detected to the time when the operator is re-executed for the first time	uint32	[0, 60000]	5000	ms	Retain the default value 5000.
IntervalTime	Interval between two re-executions	uint32	[0, 60000]	1000	ms	Retain the default value 1000.

Constraints:

This environment variable takes effect only when the HCCL re-execution feature is enabled (at any layer) using the HCCL_OP_RETRY_ENABLE environment variable.

Parent topic: Managing Lite Server Supernodes

Previous topic: Periodic Stress Test on Lite Server Supernodes

Next topic: Monitoring Lite Server Resources