Updated on 2024-05-07 GMT+08:00

Keeping a Training Job Running

You can only log in to Cloud Shell when the training job is in Running state. This section describes how to log in to a running training container through Cloud Shell.

Using the sleep Command

  • For training jobs using a preset image

    When creating a training job, set Algorithm Type to Custom algorithm and Boot Mode to Preset image, add sleep.py to the code directory, and use the script as the boot file. The training job keeps running for 60 minutes. You can access the container through Cloud Shell for debugging.

    Example of sleep.py

    import os
    os.system('sleep 60m')
    Figure 1 Using a preset image
  • For training jobs using a custom image

    When creating a training job, set Algorithm Type to Custom algorithm and Boot Mode to Custom image, and enter sleep 60m in Boot Command. The training job keeps running for 60 minutes. You can access the container through Cloud Shell for debugging.

    Figure 2 Using a custom image

Keeping a Failed Job Running

When creating a training job, add || sleep 5h at the end of the boot command and start the training job. Run the following command:
cmd || sleep 5h

If the training fails, the sleep command is executed. In this case, you can log in to the container image through Cloud Shell for debugging.

To debug a multi-node training job in Cloud Shell, you need to switch between worker-0 and worker-1 in Cloud Shell and run the boot command on each node. Otherwise, the task will wait for other nodes to join.