Keeping a Training Job Running
You can only log in to Cloud Shell when the training job is in Running state. This section describes how to log in to a running training container through Cloud Shell.
Using the sleep Command
- For training jobs using a preset image
When creating a training job, set Algorithm Type to Custom algorithm and Boot Mode to Preset image, add sleep.py to the code directory, and use the script as the boot file. The training job keeps running for 60 minutes. You can access the container through Cloud Shell for debugging.
Example of sleep.py
import os os.system('sleep 60m')
Figure 1 Using a preset image
- For training jobs using a custom image
When creating a training job, set Algorithm Type to Custom algorithm and Boot Mode to Custom image, and enter sleep 60m in Boot Command. The training job keeps running for 60 minutes. You can access the container through Cloud Shell for debugging.
Figure 2 Using a custom image
Keeping a Failed Job Running
cmd || sleep 5h
If the training fails, the sleep command is executed. In this case, you can log in to the container image through Cloud Shell for debugging.
To debug a multi-node training job in Cloud Shell, you need to switch between worker-0 and worker-1 in Cloud Shell and run the boot command on each node. Otherwise, the task will wait for other nodes to join.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot