Updated on 2024-10-29 GMT+08:00

Enabling Unconditional Auto Restart

Context

To prevent training failures and delays, use unconditional auto restart. This feature automatically restarts a failed job, regardless of the cause, improving training success rates and job stability. To prevent invalid restarts, the system limits unconditional restarts to three consecutive attempts.

To avoid losing training progress, ensure your code can resume training from where it is interrupted, and then enable unconditional auto restart to optimize compute usage. For details, see Resumable Training.

If auto restart is triggered during training, the system records the restart information. You can check the fault recovery details on the training job details page. For details, see Training Job Rescheduling.

Procedure

You can enable unconditional auto restart either on the console or through an API.

  • Using the console

    On the training job creation page, enable Auto Restart and select Unconditional auto restart. If Unconditional auto restart is enabled, the training job will be restarted unconditionally once the system detects a training exception. If you enable auto restart but do not select Unconditional auto restart, the training job will only automatically restart if it encounters environmental issues. In case of any other problems, the status of the training job will become Failed.

    Figure 1 Enabling unconditional auto restart
  • Using an API

    When creating a training job through an API, input the fault-tolerance/job-retry-num and fault-tolerance/job-unconditional-retry fields in annotations of the metadata field. To enable auto restart, set fault-tolerance/job-retry-num to a value ranging from 1 to 128. To enable unconditional auto restart, set fault-tolerance/job-unconditional-retry to true.

    {
        "kind": "job",
        "metadata": {
            "annotations": {
                "fault-tolerance/job-retry-num": "8",
                "fault-tolerance/job-unconditional-retry": "true"
            }
        }
    }