Updated on 2026-04-28 GMT+08:00

Recovering a Training Job

Description

When creating a training job, you can enable fault tolerance checks by configuring automatic restarts. When a node failure occurs or the system detects an abnormality in the training job, the fault recovery mechanism is automatically triggered. This mechanism attempts to restore training services by restarting processes or rebuilding the job. The descriptions and differences of various recovery policies are shown in Table 1.

Constraints

  1. Fault tolerance recovery recovers training services by restarting processes or rebuilding jobs. To avoid training progress loss and compute waste, ensure that the code has been adapted to resumable training before enabling this function. For details, see Resumable Training.
  2. The constraints for different recovery policies are as follows:
    Table 1 Comparison of recovery policies

    Recovery Policy

    Description

    Failure Scenario

    Constraints

    In-place recovery

    Restarts the training job within the original container without involving resource rescheduling.

    • NPU chip failure self-healing (for example, HBM multi-bit ECC).
    • Regular checkpoint (CKPT) saving is required.
    • Supports resumable training.
    • Retains the container environment from the time of failure; requires the job script to be re-entrant.
    • No standby nodes required.

    Unconditional job-level rescheduling

    Terminates all Pods associated with the job and completely rebuilds the job instance when a job exits with a non-zero error code.

    • Occasional software failures or network fluctuations.
    • Regular CKPT saving is required.
    • Supports resumable training.
    • Operations must support interruption and exit with a non-zero error code upon failure.
    • No standby nodes required.

    Isolated job-level rescheduling

    Based on unconditional job-level rescheduling, if a node failure is detected during a job anomaly, the faulty node is isolated before rescheduling.

    • All NPU chip failures.
    • Node failures.
    • Regular CKPT saving is required.
    • Supports resumable training.
    • Healthy standby nodes required.

    Pod-level rescheduling

    Retains the existing job instance, isolates the faulty node, and only recreates the faulty Pods.

    • All NPU chip failures.
    • Node failures.
    • Regular CKPT saving is required.
    • Supports resumable training.
    • Some instances retain the container environment from the time of failure; requires the job script to be re-entrant.
    • Healthy standby nodes required.

    Restart upon suspension

    If ModelArts detects a suspension state during job execution, it forcibly terminates user processes within the container. The container itself is not destroyed, and the training start command is re-executed. This does not involve resource rescheduling.

    • Training job suspension detected by ModelArts.
    • Regular CKPT saving is required.
    • Supports resumable training.
    • Retains the container environment from the time of failure; requires the job script to be re-entrant.
    • No standby nodes required.

Fault Recovery Trigger Modes

  • Console settings

    When creating a training job, you can enable it based on Step 6: Configuring Fault Tolerance and Recovery.

    Auto Restart: Includes in-place recovery and isolated job-level rescheduling. Note that in-place recovery does not consume Maximum Restarts.

    Unconditional Auto Restart: Includes unconditional job-level rescheduling.

    Restart Upon Suspension: Includes restarts triggered by a suspension state. Job suspension-triggered restarts do not consume Maximum Restarts.

    Figure 1 Fault tolerance and recovery
  • API settings

    Auto restart: When creating a training job via the API, include the fault-tolerance/job-retry-num key within the annotations of the metadata field. Assign an integer between 1 and 128 to enable auto-restart and set the maximum retry attempts.

    Unconditional auto restart: Assign a value to fault-tolerance/job-retry-num and set fault-tolerance/job-unconditional-retry to true.

    Restart upon suspension: Assign a value to fault-tolerance/job-retry-num and set fault-tolerance/hang-retry to true.

    Pod rescheduling: Assign values to both fault-tolerance/job-retry-num and fault-tolerance/pod-retry-num.

    Parameter

    Mandatory

    Type

    Description

    annotations

    No

    Map<String,String>

    Description: Advanced functions of a training job.

    Constraints: The options are as follows:

    • fault-tolerance/job-retry-num: 3 (number of retries upon a fault)

    • fault-tolerance/job-unconditional-retry: true (unconditional restart)

    • fault-tolerance/hang-retry: true (restart upon suspension)

    • "fault-tolerance/pod-retry-num": 3 (number of pod reschedules)
    {
        "kind": "job",
        "metadata": {
            "annotations": {
                "fault-tolerance/job-retry-num": "8",
                "fault-tolerance/job-unconditional-retry": "true",
               "fault-tolerance/hang-retry": "true",
                "fault-tolerance/pod-retry-num": "3"
            }
        }
    }

Fault Recovery Environment Variables

The following environment variables can be used to determine whether a job has undergone resumable, primarily ensuring the reentrancy of training scripts.

Variable

Description

MA_SCHEDULE_CNT

Represents the number of times the Pod where the task resides has been scheduled. For a newly submitted job, the initial value is 1. After each rescheduling recovery, MA_SCHEDULE_CNT increments by 1.

Therefore, when MA_SCHEDULE_CNT > 1, it indicates that the current container has undergone at least one rescheduling recovery.

MA_PROC_START_CNT

Represents the number of times your script has been executed within the current Pod. When a new job is submitted or a job script starts after rescheduling recovery, MA_PROC_START_CNT is reset to 1. After each in-place recovery or restart upon suspension, this value increments by 1 when your script is executed again.

Therefore, when MA_PROC_START_CNT > 1, it indicates that your script has already been executed in the current container. If the process involves shared memory or data loading, this variable can be used to determine whether to reopen shared memory or skip data loading.

The following environment variable can be used to accelerate the speed of rescheduling.

MA_FAILOVER_TERMINATION_GRACE_PERIOD_SECONDS

When set to a positive integer N, the volume unmounting step during the container termination phase will be set to asynchronous execution after N seconds. This effectively reduces the rescheduling recovery time for large-scale jobs. It is recommended to set this to 10 in scenarios primarily using SFS Turbo storage.

Default value: -1

In-Place Recovery

During the operation of NPU training jobs, chip failures may occur. Some of these failures can be self-healed through system repair or resetting. For such self-healable chip failures, the system forcibly terminates the user processes within the container. The container itself is not destroyed, thus preserving the runtime environment. After all processes are terminated, the system attempts NPU chip self-healing. If the fault is cleared and the chip returns to normal, all containers will re-execute the training job's startup command. In-place recovery does not involve resource rescheduling; it simply restarts the training job within the original containers. The process is illustrated in this figure.

Figure 2 In-place recovery

Trigger Scenarios

A self-healable fault occurs on the NPU chip.

Constraints

  • The job must periodically save CKPTs.
  • The job must support resumable training.
  • In-place recovery preserves the container environment from the time of failure. This requires the training script to be reentrant. Typically, you need to skip data downloading and preprocessing steps, and delete and rebuild shared memory with the same name. You can use the environment variable MA_PROC_START_CNT to determine if an in-place recovery has occurred.

Degradation Policies

  • If chip self-healing fails, the in-place recovery fails simultaneously. In this case, the node will be isolated, and the system will degrade to isolated job-level rescheduling.
  • During a single training session, if the same fault code occurs 3 consecutive times within 24 hours on the same chip of the same node, the node will be isolated and the system will degrade to isolated job-level rescheduling.
  • During a single training session, if the same fault code occurs 3 consecutive times within 24 hours on the same chip of the same node, and the fault is caused by your input (such as codes 80C98002 or 80CB8002), the node will not be isolated, and only rescheduling will be performed.

Self-healable Faults: Typically fault codes with Minor or Major severity levels. Warning levels do not require processing, while Critical levels cannot be self-healed.

User-Induced Faults: NPU failures caused by operator anomalies or input data. These usually require troubleshooting the CANN version, operator implementation, or data integrity.

Unconditional Job-Level Rescheduling

During the training process, unexpected situations may occur that lead to training failure and prevent the job from restarting in a timely manner, thereby extending the training cycle. Unconditional job-level rescheduling is designed to avoid such issues, improving the training success rate and job stability. When a job terminates abnormally with a non-zero exit code, Unconditional job-level rescheduling terminates all Pods associated with the job and completely rebuilds the job instance. The process is illustrated in this figure.

Figure 3 Unconditional job-level rescheduling

Trigger Scenarios

  • Degradation caused by user-input-related faults during in-place recovery.
  • Job failure or interruption with a non-zero exit code.
  • Degradation resulting from a failed pod-level rescheduling attempt.

Constraints

  • The job must periodically save CKPTs.
  • The job must support resumable training.
  • Operations must support interruption and exit with a non-zero error code upon failure. If the process cannot interrupt and remains in a running state during an anomaly, job-level rescheduling will not be triggered.

Degradation Policies

If the system triggers unconditional job rescheduling three times in a row and the issue persists on the fourth attempt with no clear node or chip faults, the system assumes the user's code has errors. Consequently, it stops the job and marks it as failed.

Isolated Job-Level Rescheduling

Isolated job-level rescheduling builds upon unconditional job-level rescheduling. If a node failure is detected when a job encounters an anomaly, the system will isolate the faulty node before performing rescheduling. The process is illustrated in this figure.

Figure 4 Isolated job-level rescheduling

Trigger Scenarios

  • Degradation caused by non-user-input factors during in-place recovery.
  • Occurrence of a node failure.

Constraints

  • The job must periodically save CKPTs.
  • The job must support resumable training.

Degradation Policies

None.

Pod-Level Rescheduling

Compared to isolated job-level rescheduling, which deletes and rebuilds the entire job, pod-level rescheduling keeps the existing job instance. It isolates the faulty node first and only recreates the specific Pods that were affected by the failure.

Figure 5 Pod-level rescheduling

Trigger Scenarios

  • Degradation caused by non-user-input factors during in-place recovery.
  • Occurrence of a node failure.

Constraints

  • The job must periodically save CKPTs.
  • The job must support resumable training.
  • Pod-level rescheduling preserves the container environment of the healthy instances. This requires the training script to be reentrant. Typically, you need to skip data downloading or preprocessing steps, and delete and rebuild shared memory with the same name.

Degradation Policies

If pod-level rescheduling fails, the system will degrade to Unconditional Job-Level Rescheduling (since the faulty node has already been isolated).

Restart upon Suspension

If a stable training job runs properly for a while and then gets suspended without a hardware fault, you can restart it to fix the issue. However, because a suspended training job cannot automatically terminate its container, it cannot directly trigger job-level rescheduling; instead, restart upon suspension must be configured. When restart upon suspension is enabled, ModelArts monitors the job's status during runtime. Upon detecting a suspension, it forcibly terminates your processes within the container. The container itself is not destroyed, thus preserving the runtime environment. Once the processes are stopped, the system re-executes the training job's startup command. This does not involve resource rescheduling.

For details about the suspension detection rules, see Detecting Training Job Suspension. The process is illustrated in this figure.

Figure 6 Restart upon suspension

Trigger Scenarios

  • The system detects a suspension in an NPU or GPU training job.

Constraints

  • The job must periodically save CKPTs.
  • The job must support resumable training.
  • Restart upon suspension preserves the container environment from the time of the incident. This requires the training script to be reentrant, which typically involves handling operation logic such as data downloading, data preprocessing, and the creation of shared memory with the same name.

Degradation Policies

If restart upon suspension is triggered 3 consecutive times, the system will automatically terminate the job and set it to a failed status.