Unconditional Auto Restart
Context
Unexpected situations during training can lead to failures and delays in restarting the job, resulting in longer training periods. To avoid these issues, use unconditional auto restart. Unconditional auto restart means that the system will automatically restart a failed training job, regardless of the cause. This feature can improve the success rate of training and increase job stability. To prevent invalid restarts, it supports a maximum of three consecutive unconditional restarts.
To avoid losing training progress and make full use of compute power, ensure that your code logic supports resumable training before enabling this function. For details, see Resumable Training and Incremental Training.
If auto restart is triggered during training, the system records the restart information. You can check the fault recovery details on the training job details page. For details, see Viewing Fault Recovery Details.
Enabling Unconditional Auto Restart
You can enable unconditional auto restart either on the console or through an API.
- Using the console
On the Create Training Job page, enable Auto Restart and select Unconditional auto restart. If Unconditional auto restart is enabled, the training job will be restarted unconditionally once the system detects a training exception. If you enable auto restart but do not select Unconditional auto restart, the training job will only be automatically restarted if it encounters environmental issues. In case of any other problems, the status of the training job will become Failed.
Figure 1 Enabling unconditional auto restart
- Using an API
When creating a training job through an API, input the fault-tolerance/job-retry-num and fault-tolerance/job-unconditional-retry fields in annotations of the metadata field. To enable auto restart, set fault-tolerance/job-retry-num to a value ranging from 1 to 128. To enable unconditional auto restart, set fault-tolerance/job-unconditional-retry to true.
{ "kind": "job", "metadata": { "annotations": { "fault-tolerance/job-retry-num": "8", "fault-tolerance/job-unconditional-retry": "true" } } }
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot