Updated on 2025-08-18 GMT+08:00

Training Job Restart Upon Suspension

If a stable training job runs properly for a while and then gets suspended without a hardware fault, you can restart it to fix the issue. However, the container cannot be automatically stopped during suspension. In this case, auto restart cannot be used and you must set the job to restart upon suspension. When configured this way, ModelArts forcibly stops the user process in the container upon detecting suspension, then runs the training job boot command again after stopping the process. Resource scheduling is not involved during this restart; the job restarts in the original container.

For details about the suspension detection rules, see Detecting Training Job Suspension.

To avoid losing training progress, ensure your code can resume training from where it is interrupted, and then enable unconditional auto restart to optimize compute usage. For details, see Resumable Training.

If auto restart is triggered during training, the system records the restart information. You can check the fault recovery details on the training job details page. For details, see Training Job Fault Tolerance Check.

Notes and Constraints

To prevent compute waste from invalid restarts, training jobs can restart upon suspension a maximum of three consecutive times.

Enabling Restart Upon Suspension

To enable restart upon suspension, use either the console or API.

  • Using the console

    On the Create Training Job page, enable auto restart and restart upon suspension. After enabling restart upon suspension, the system restarts the training job in the container when a suspension is detected. This process does not involve resource scheduling, so Restarts is not affected.

    Figure 1 Enabling restart upon suspension
  • Using an API

    When creating a training job through an API, input the fault-tolerance/job-retry-num and fault-tolerance/hang-retry fields in annotations of the metadata field. To enable auto restart and set the number of restarts, set fault-tolerance/job-retry-num to an integer between 1 and 128. For restart upon suspension, set fault-tolerance/hang-retry to true.

    Table 1 Parameters

    Parameter

    Mandatory

    Type

    Description

    kind

    Yes

    String

    Description: Type of a training job.

    Constraints: N/A.

    Options:

    • job: common job

    • edge_job: edge job

    • hetero_job: heterogeneous job

    • mrs_job: MRS job

    • autosearch_job: auto search job

    • diag_job: diagnosis job

    • visualization_job: visualization job

    Default value: job

    annotations

    No

    Map<String,String>

    Description: Advanced functions of a training job.

    Constraints: The options are as follows:

    • job_template: Template RL (heterogeneous job)

    • fault-tolerance/job-retry-num: 3 (number of retries upon a fault)

    • fault-tolerance/job-unconditional-retry: true (unconditional restart)

    • fault-tolerance/hang-retry: true (restart upon suspension)

    • jupyter-lab/enable: true (JupyterLab training application)

    • tensorboard/enable: true (TensorBoard training application)

    • mindstudio-insight/enable: true (MindStudio Insight training application)

    {
        "kind": "job",
        "metadata": {
            "annotations": {
                "fault-tolerance/job-retry-num": "3",
                "fault-tolerance/hang-retry": "true"
            }
        }
    }