Updated on 2024-05-23 GMT+08:00

Unconditional Auto Restart

Context

Unexpected situations during training can lead to failures and delays in restarting the job, resulting in longer training periods. To avoid these issues, use unconditional auto restart. Unconditional auto restart means that the system will automatically restart a failed training job, regardless of the cause. This feature can improve the success rate of training and increase job stability. To prevent invalid restarts, it supports a maximum of three consecutive unconditional restarts.

To avoid losing training progress and make full use of compute power, ensure that your code logic supports resumable training before enabling this function. For details, see Resumable Training and Incremental Training.

If auto restart is triggered during training, the system records the restart information. You can check the fault recovery details on the training job details page. For details, see Viewing Fault Recovery Details.

Enabling Unconditional Auto Restart

You can enable unconditional auto restart either on the console or through an API.

  • Using the console

    On the Create Training Job page, enable Auto Restart and select Unconditional auto restart. If Unconditional auto restart is enabled, the training job will be restarted unconditionally once the system detects a training exception. If you enable auto restart but do not select Unconditional auto restart, the training job will only be automatically restarted if it encounters environmental issues. In case of any other problems, the status of the training job will become Failed.

    Figure 1 Enabling unconditional auto restart
  • Using an API

    When creating a training job through an API, input the fault-tolerance/job-retry-num and fault-tolerance/job-unconditional-retry fields in annotations of the metadata field. To enable auto restart, set fault-tolerance/job-retry-num to a value ranging from 1 to 128. To enable unconditional auto restart, set fault-tolerance/job-unconditional-retry to true.

    {
        "kind": "job",
        "metadata": {
            "annotations": {
                "fault-tolerance/job-retry-num": "8",
                "fault-tolerance/job-unconditional-retry": "true"
            }
        }
    }