Recovering a Training Job
Description
When creating a training job, you can enable fault tolerance checks by configuring automatic restarts. When a node failure occurs or the system detects an abnormality in the training job, the fault recovery mechanism is automatically triggered. This mechanism attempts to restore training services by restarting processes or rebuilding the job. The descriptions and differences of various recovery policies are shown in Table 1.
Constraints
- Fault tolerance recovery recovers training services by restarting processes or rebuilding jobs. To avoid training progress loss and compute waste, ensure that the code has been adapted to resumable training before enabling this function. For details, see Resumable Training.
- The constraints for different recovery policies are as follows:
Table 1 Comparison of recovery policies Recovery Policy
Description
Failure Scenario
Constraints
Restarts the training job within the original container without involving resource rescheduling.
- NPU chip failure self-healing (for example, HBM multi-bit ECC).
- Regular checkpoint (CKPT) saving is required.
- Supports resumable training.
- Retains the container environment from the time of failure; requires the job script to be re-entrant.
- No standby nodes required.
Terminates all Pods associated with the job and completely rebuilds the job instance when a job exits with a non-zero error code.
- Occasional software failures or network fluctuations.
- Regular CKPT saving is required.
- Supports resumable training.
- Operations must support interruption and exit with a non-zero error code upon failure.
- No standby nodes required.
Based on unconditional job-level rescheduling, if a node failure is detected during a job anomaly, the faulty node is isolated before rescheduling.
- All NPU chip failures.
- Node failures.
- Regular CKPT saving is required.
- Supports resumable training.
- Healthy standby nodes required.
Retains the existing job instance, isolates the faulty node, and only recreates the faulty Pods.
- All NPU chip failures.
- Node failures.
- Regular CKPT saving is required.
- Supports resumable training.
- Some instances retain the container environment from the time of failure; requires the job script to be re-entrant.
- Healthy standby nodes required.
If ModelArts detects a suspension state during job execution, it forcibly terminates user processes within the container. The container itself is not destroyed, and the training start command is re-executed. This does not involve resource rescheduling.
- Training job suspension detected by ModelArts.
- Regular CKPT saving is required.
- Supports resumable training.
- Retains the container environment from the time of failure; requires the job script to be re-entrant.
- No standby nodes required.
Fault Recovery Trigger Modes
- Console settings
When creating a training job, you can enable it based on Step 6: Configuring Fault Tolerance and Recovery.
Auto Restart: Includes in-place recovery and isolated job-level rescheduling. Note that in-place recovery does not consume Maximum Restarts.
Unconditional Auto Restart: Includes unconditional job-level rescheduling.
Restart Upon Suspension: Includes restarts triggered by a suspension state. Job suspension-triggered restarts do not consume Maximum Restarts.
Figure 1 Fault tolerance and recovery
- API settings
Auto restart: When creating a training job via the API, include the fault-tolerance/job-retry-num key within the annotations of the metadata field. Assign an integer between 1 and 128 to enable auto-restart and set the maximum retry attempts.
Unconditional auto restart: Assign a value to fault-tolerance/job-retry-num and set fault-tolerance/job-unconditional-retry to true.
Restart upon suspension: Assign a value to fault-tolerance/job-retry-num and set fault-tolerance/hang-retry to true.
Pod rescheduling: Assign values to both fault-tolerance/job-retry-num and fault-tolerance/pod-retry-num.
Parameter
Mandatory
Type
Description
annotations
No
Map<String,String>
Description: Advanced functions of a training job.
Constraints: The options are as follows:
{ "kind": "job", "metadata": { "annotations": { "fault-tolerance/job-retry-num": "8", "fault-tolerance/job-unconditional-retry": "true", "fault-tolerance/hang-retry": "true", "fault-tolerance/pod-retry-num": "3" } } }
Fault Recovery Environment Variables
The following environment variables can be used to determine whether a job has undergone resumable, primarily ensuring the reentrancy of training scripts.
| Variable | Description |
|---|---|
| MA_SCHEDULE_CNT | Represents the number of times the Pod where the task resides has been scheduled. For a newly submitted job, the initial value is 1. After each rescheduling recovery, MA_SCHEDULE_CNT increments by 1. Therefore, when MA_SCHEDULE_CNT > 1, it indicates that the current container has undergone at least one rescheduling recovery. |
| MA_PROC_START_CNT | Represents the number of times your script has been executed within the current Pod. When a new job is submitted or a job script starts after rescheduling recovery, MA_PROC_START_CNT is reset to 1. After each in-place recovery or restart upon suspension, this value increments by 1 when your script is executed again. Therefore, when MA_PROC_START_CNT > 1, it indicates that your script has already been executed in the current container. If the process involves shared memory or data loading, this variable can be used to determine whether to reopen shared memory or skip data loading. |
The following environment variable can be used to accelerate the speed of rescheduling.
| MA_FAILOVER_TERMINATION_GRACE_PERIOD_SECONDS | When set to a positive integer N, the volume unmounting step during the container termination phase will be set to asynchronous execution after N seconds. This effectively reduces the rescheduling recovery time for large-scale jobs. It is recommended to set this to 10 in scenarios primarily using SFS Turbo storage. Default value: -1 |
In-Place Recovery
During the operation of NPU training jobs, chip failures may occur. Some of these failures can be self-healed through system repair or resetting. For such self-healable chip failures, the system forcibly terminates the user processes within the container. The container itself is not destroyed, thus preserving the runtime environment. After all processes are terminated, the system attempts NPU chip self-healing. If the fault is cleared and the chip returns to normal, all containers will re-execute the training job's startup command. In-place recovery does not involve resource rescheduling; it simply restarts the training job within the original containers. The process is illustrated in this figure.
Trigger Scenarios
A self-healable fault occurs on the NPU chip.
Constraints
- The job must periodically save CKPTs.
- The job must support resumable training.
- In-place recovery preserves the container environment from the time of failure. This requires the training script to be reentrant. Typically, you need to skip data downloading and preprocessing steps, and delete and rebuild shared memory with the same name. You can use the environment variable MA_PROC_START_CNT to determine if an in-place recovery has occurred.
Degradation Policies
- If chip self-healing fails, the in-place recovery fails simultaneously. In this case, the node will be isolated, and the system will degrade to isolated job-level rescheduling.
- During a single training session, if the same fault code occurs 3 consecutive times within 24 hours on the same chip of the same node, the node will be isolated and the system will degrade to isolated job-level rescheduling.
- During a single training session, if the same fault code occurs 3 consecutive times within 24 hours on the same chip of the same node, and the fault is caused by your input (such as codes 80C98002 or 80CB8002), the node will not be isolated, and only rescheduling will be performed.
Self-healable Faults: Typically fault codes with Minor or Major severity levels. Warning levels do not require processing, while Critical levels cannot be self-healed.
User-Induced Faults: NPU failures caused by operator anomalies or input data. These usually require troubleshooting the CANN version, operator implementation, or data integrity.
Unconditional Job-Level Rescheduling
During the training process, unexpected situations may occur that lead to training failure and prevent the job from restarting in a timely manner, thereby extending the training cycle. Unconditional job-level rescheduling is designed to avoid such issues, improving the training success rate and job stability. When a job terminates abnormally with a non-zero exit code, Unconditional job-level rescheduling terminates all Pods associated with the job and completely rebuilds the job instance. The process is illustrated in this figure.
Trigger Scenarios
- Degradation caused by user-input-related faults during in-place recovery.
- Job failure or interruption with a non-zero exit code.
- Degradation resulting from a failed pod-level rescheduling attempt.
Constraints
- The job must periodically save CKPTs.
- The job must support resumable training.
- Operations must support interruption and exit with a non-zero error code upon failure. If the process cannot interrupt and remains in a running state during an anomaly, job-level rescheduling will not be triggered.
Degradation Policies
If the system triggers unconditional job rescheduling three times in a row and the issue persists on the fourth attempt with no clear node or chip faults, the system assumes the user's code has errors. Consequently, it stops the job and marks it as failed.
Isolated Job-Level Rescheduling
Isolated job-level rescheduling builds upon unconditional job-level rescheduling. If a node failure is detected when a job encounters an anomaly, the system will isolate the faulty node before performing rescheduling. The process is illustrated in this figure.
Trigger Scenarios
- Degradation caused by non-user-input factors during in-place recovery.
- Occurrence of a node failure.
Constraints
- The job must periodically save CKPTs.
- The job must support resumable training.
Degradation Policies
None.
Pod-Level Rescheduling
Compared to isolated job-level rescheduling, which deletes and rebuilds the entire job, pod-level rescheduling keeps the existing job instance. It isolates the faulty node first and only recreates the specific Pods that were affected by the failure.
Trigger Scenarios
- Degradation caused by non-user-input factors during in-place recovery.
- Occurrence of a node failure.
Constraints
- The job must periodically save CKPTs.
- The job must support resumable training.
- Pod-level rescheduling preserves the container environment of the healthy instances. This requires the training script to be reentrant. Typically, you need to skip data downloading or preprocessing steps, and delete and rebuild shared memory with the same name.
Degradation Policies
If pod-level rescheduling fails, the system will degrade to Unconditional Job-Level Rescheduling (since the faulty node has already been isolated).
Restart upon Suspension
If a stable training job runs properly for a while and then gets suspended without a hardware fault, you can restart it to fix the issue. However, because a suspended training job cannot automatically terminate its container, it cannot directly trigger job-level rescheduling; instead, restart upon suspension must be configured. When restart upon suspension is enabled, ModelArts monitors the job's status during runtime. Upon detecting a suspension, it forcibly terminates your processes within the container. The container itself is not destroyed, thus preserving the runtime environment. Once the processes are stopped, the system re-executes the training job's startup command. This does not involve resource rescheduling.
For details about the suspension detection rules, see Detecting Training Job Suspension. The process is illustrated in this figure.
Trigger Scenarios
- The system detects a suspension in an NPU or GPU training job.
Constraints
- The job must periodically save CKPTs.
- The job must support resumable training.
- Restart upon suspension preserves the container environment from the time of the incident. This requires the training script to be reentrant, which typically involves handling operation logic such as data downloading, data preprocessing, and the creation of shared memory with the same name.
Degradation Policies
If restart upon suspension is triggered 3 consecutive times, the system will automatically terminate the job and set it to a failed status.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot