Training Job Restart Upon Suspension
If a stable training job runs properly for a while and then gets suspended without a hardware fault, you can restart it to fix the issue. However, the container cannot be automatically stopped during suspension. In this case, auto restart cannot be used and you must set the job to restart upon suspension. When configured this way, ModelArts forcibly stops the user process in the container upon detecting suspension, then runs the training job boot command again after stopping the process. Resource scheduling is not involved during this restart; the job restarts in the original container.
For details about the suspension detection rules, see Detecting Training Job Suspension.
To avoid losing training progress, ensure your code can resume training from where it is interrupted, and then enable unconditional auto restart to optimize compute usage. For details, see Resumable Training.
If auto restart is triggered during training, the system records the restart information. You can check the fault recovery details on the training job details page. For details, see Training Job Fault Tolerance Check.
Notes and Constraints
To prevent compute waste from invalid restarts, training jobs can restart upon suspension a maximum of three consecutive times.
Enabling Restart Upon Suspension
To enable restart upon suspension, use either the console or API.
- Using the console
On the Create Training Job page, enable auto restart and restart upon suspension. After enabling restart upon suspension, the system restarts the training job in the container when a suspension is detected. This process does not involve resource scheduling, so Restarts is not affected.
Figure 1 Enabling restart upon suspension - Using an API
When creating a training job through an API, input the fault-tolerance/job-retry-num and fault-tolerance/hang-retry fields in annotations of the metadata field. To enable auto restart and set the number of restarts, set fault-tolerance/job-retry-num to an integer between 1 and 128. For restart upon suspension, set fault-tolerance/hang-retry to true.
Table 1 Parameters Parameter
Mandatory
Type
Description
kind
Yes
String
Description: Type of a training job.
Constraints: N/A.
Options:
Default value: job
annotations
No
Map<String,String>
Description: Advanced functions of a training job.
Constraints: The options are as follows:
{ "kind": "job", "metadata": { "annotations": { "fault-tolerance/job-retry-num": "3", "fault-tolerance/hang-retry": "true" } } }
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot