Keeping a Training Job Running

You can only log in to Cloud Shell when the training job is in Running state. This section describes how to log in to a running training container through Cloud Shell.

Using the sleep Command

For training jobs using a preset image
When creating a training job, set Algorithm Type to Custom algorithm and Boot Mode to Preset image, add sleep.py to the code directory, and use the script as the boot file. The training job keeps running for 60 minutes. You can access the container through Cloud Shell for debugging.

Example of sleep.py
```
import os
os.system('sleep 60m')
```
Figure 1 Using a preset image
For training jobs using a custom image
When creating a training job, set Algorithm Type to Custom algorithm and Boot Mode to Custom image, and enter sleep 60m in Boot Command. The training job keeps running for 60 minutes. You can access the container through Cloud Shell for debugging.

Figure 2 Using a custom image

Keeping a Failed Job Running

When creating a training job, add || sleep 5h at the end of the boot command and start the training job. Run the following command:

cmd || sleep 5h

If the training fails, the sleep command is executed. In this case, you can log in to the container image through Cloud Shell for debugging.

To debug a multi-node training job in Cloud Shell, you need to switch between worker-0 and worker-1 in Cloud Shell and run the boot command on each node. Otherwise, the task will wait for other nodes to join.

Parent topic: Cloud Shell

Previous topic: Logging In to a Training Container Using Cloud Shell

Next topic: Preventing Cloud Shell Session from Disconnection