Enabling Unconditional Auto Restart
Context
To prevent training failures and delays, use unconditional auto restart. This feature automatically restarts a failed job, regardless of the cause, improving training success rates and job stability. To prevent invalid restarts, the system limits unconditional restarts to three consecutive attempts.
To avoid losing training progress, ensure your code can resume training from where it is interrupted, and then enable unconditional auto restart to optimize compute usage. For details, see Resumable Training.
If auto restart is triggered during training, the system records the restart information. You can check the fault recovery details on the training job details page. For details, see Training Job Rescheduling.
Procedure
You can enable unconditional auto restart either on the console or through an API.
- Using the console
On the training job creation page, enable Auto Restart and select Unconditional auto restart. If Unconditional auto restart is enabled, the training job will be restarted unconditionally once the system detects a training exception. If you enable auto restart but do not select Unconditional auto restart, the training job will only automatically restart if it encounters environmental issues. In case of any other problems, the status of the training job will become Failed.
Figure 1 Enabling unconditional auto restart
- Using an API
When creating a training job through an API, input the fault-tolerance/job-retry-num and fault-tolerance/job-unconditional-retry fields in annotations of the metadata field. To enable auto restart, set fault-tolerance/job-retry-num to a value ranging from 1 to 128. To enable unconditional auto restart, set fault-tolerance/job-unconditional-retry to true.
{ "kind": "job", "metadata": { "annotations": { "fault-tolerance/job-retry-num": "8", "fault-tolerance/job-unconditional-retry": "true" } } }
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot