Training Job Restart Upon Suspension

If a stable training job runs properly for a while and then gets suspended without a hardware fault, you can restart it to fix the issue. However, the container cannot be automatically stopped during suspension. In this case, auto restart cannot be used and you must set the job to restart upon suspension. When configured this way, ModelArts forcibly stops the user process in the container upon detecting suspension, then runs the training job boot command again after stopping the process. Resource scheduling is not involved during this restart; the job restarts in the original container.

For details about the suspension detection rules, see Detecting Training Job Suspension.

To avoid losing training progress, ensure your code can resume training from where it is interrupted, and then enable unconditional auto restart to optimize compute usage. For details, see Resumable Training.

If auto restart is triggered during training, the system records the restart information. You can check the fault recovery details on the training job details page. For details, see Training Job Fault Tolerance Check.

Notes and Constraints

To prevent compute waste from invalid restarts, training jobs can restart upon suspension a maximum of three consecutive times.

Enabling Restart Upon Suspension

To enable restart upon suspension, use either the console or API.

Using the console
On the Create Training Job page, enable auto restart and restart upon suspension. After enabling restart upon suspension, the system restarts the training job in the container when a suspension is detected. This process does not involve resource scheduling, so Restarts is not affected.

ModelArts continuously monitors job processes to detect suspension and optimize resource usage. When this feature is enabled, suspended jobs can be automatically restarted at the process level.

CPU specifications do not support job restarts upon suspension.

However, ModelArts does not verify code logic, and suspension detection is periodic, which may result in false reports. By enabling this feature, you acknowledge the possibility of false positives. To prevent unnecessary restarts, ModelArts limits consecutive restarts to three.

After the configuration is complete, you can view the configuration result on the confirmation page and training job details page. When Restart Upon Suspension is enabled during job creation, Open is displayed. It is displayed after auto restart is enabled. If it is not configured or is not enabled, Disabled is displayed.

Figure 1 Enabling restart upon suspension

Using an API

When creating a training job through an API, input the fault-tolerance/job-retry-num and fault-tolerance/hang-retry fields in annotations of the metadata field. To enable auto restart and set the number of restarts, set fault-tolerance/job-retry-num to an integer between 1 and 128. For restart upon suspension, set fault-tolerance/hang-retry to true.

**Table 1** Parameters
Parameter	Mandatory	Type	Description
kind	Yes	String	Description: Type of a training job. Constraints: N/A. Options: job: common job edge_job: edge job hetero_job: heterogeneous job mrs_job: MRS job autosearch_job: auto search job diag_job: diagnosis job visualization_job: visualization job Default value: job
annotations	No	Map<String,String>	Description: Advanced functions of a training job. Constraints: The options are as follows: job_template: Template RL (heterogeneous job) fault-tolerance/job-retry-num: 3 (number of retries upon a fault) fault-tolerance/job-unconditional-retry: true (unconditional restart) fault-tolerance/hang-retry: true (restart upon suspension) jupyter-lab/enable: true (JupyterLab training application) tensorboard/enable: true (TensorBoard training application) mindstudio-insight/enable: true (MindStudio Insight training application)

{
    "kind": "job",
    "metadata": {
        "annotations": {
            "fault-tolerance/job-retry-num": "3",
            "fault-tolerance/hang-retry": "true"
        }
    }
}

Parent topic: High Model Training Reliability

Previous topic: Detecting Training Job Suspension

Next topic: Resumable Training