Using Cloud Shell to Debug a Production Training Job

ModelArts Standard provides Cloud Shell, which allows you to log in to a running container to debug training jobs in the production environment.

Constraints

Only dedicated resource pools allow logging in to training containers using Cloud Shell. The training job must be running.

Preparation: Assigning the Cloud Shell Permission to an IAM User

Log in to the Huawei Cloud management console as a tenant user, hover the cursor over your username in the upper right corner, and choose Identity and Access Management from the drop-down list to switch to the IAM management console.
On the IAM console, choose Permissions > Policies/Roles from the navigation pane, click Create Custom Policy in the upper right corner, and configure the following parameters.
- Policy Name: Enter a custom policy name, for example, Using Cloud Shell to access a running job.
- Policy View: Select Visual editor.
- Policy Content: Select Allow, ModelArts Service, modelarts:trainJob:exec, and default resources.
In the navigation pane, choose User Groups. Then, click Authorize in the Operation column of the target user group. On the Authorize User Group page, select the custom policies created in 2, and click Next. Then, select the scope and click OK.
After the configuration, all users in the user group have the permission to use Cloud Shell to log in to a running training container.

If no user group is available, create a user group, add users using the user group management function, and configure authorization. If the target user is not in a user group, you can add the user to a user group through the user group management function.

Using Cloud Shell

Configure parameters based on Preparation: Assigning the Cloud Shell Permission to an IAM User.
Log in to the ModelArts console. In the navigation pane, choose Model Training > Training Jobs.
In the training job list, click the name of the target job to go to the training job details page.
On the training job details page, click the Cloud Shell tab and log in to the training container.
Verify that the login is successful, as shown in the following figure.

Figure 1 Cloud Shell page

If the job is not running or the permission is insufficient, Cloud Shell cannot be used. In this case, locate the fault as prompted.

Figure 2 Error message

If you encounter a path display issue when logging in to Cloud Shell, press Enter to resolve the problem.
Figure 3 Path display issue

Keeping a Training Job Running

You can only log in to Cloud Shell when the training job is in Running state. This section describes how to log in to a running training container through Cloud Shell.

Using the sleep Command

For training jobs using a preset image
When creating a training job, set Algorithm Type to Custom algorithm and Boot Mode to Preset image, add sleep.py to the code directory, and use the script as the boot file. The training job keeps running for 60 minutes. You can access the container through Cloud Shell for debugging.

Example of sleep.py
```
import os
os.system('sleep 60m')
```
Figure 4 Using a preset image
For training jobs using a custom image
When creating a training job, set Algorithm Type to Custom algorithm and Boot Mode to Custom image, and enter sleep 60m in Boot Command. The training job keeps running for 60 minutes. You can access the container through Cloud Shell for debugging.

Figure 5 Using a custom image

Keeping a Failed Job Running

When creating a training job, add || sleep 5h at the end of the boot command and start the training job. For example:

cmd || sleep 5h

If the training fails, the sleep command is executed. In this case, you can log in to the container image through Cloud Shell for debugging.

To debug a multi-node training job in Cloud Shell, you need to switch between worker-0 and worker-1 in Cloud Shell and run the boot command on each node. Otherwise, the task will wait for other nodes to join.

Preventing Cloud Shell Session from Disconnection

To run a job for a long time, you can use the screen command to run the job in a remote terminal that stays active even if you disconnect. This prevents the job from failing due to disconnection.

If screen is not installed in the image, run apt-get install screen to install it.

Create a screen terminal.

# Use -S to create a screen terminal named name.
screen -S name

View the created screen terminals.

screen -ls  
There are screens on: 
2433.pts-3.linux    (2013-10-20 16:48:59) (Detached)
2428.pts-3.linux    (2013-10-20 16:48:05) (Detached)
2284.pts-3.linux    (2013-10-20 16:14:55) (Detached)
2276.pts-3.linux    (2013-10-20 16:13:18) (Detached)
4 Sockets in /var/run/screen/S-root.

Connect to the screen terminal whose screen_id is 2276.
```
screen -r 2276
```
Press Ctrl+A+D to exit the screen terminal. After the exit, the screen session is still active and can be reconnected at any time.

For details about how to use screens, see Screen User's Manual.

Analyzing the Call Stack of the Suspended Process Using the py-spy Tool

Use py-spy to analyze the call stack of a suspended process and identify the issue.

On the ModelArts Standard console, choose Model Training > Training Jobs.
Click the target training job to go to its details page. On the page that appears, click the Cloud Shell tab and log in to the training container (the training job must be in the Running state).

Install the py-spy tool.

# Use the utils.sh script to automatically configure the Python environment.
source /home/ma-user/modelarts/run/utils.sh

# Install py-spy.
pip install py-spy

# If the message "connection broken by 'ProxyError('Cannot connect to proxy.')" is displayed, disable the proxy.
export no_proxy=$no_proxy,repo.myhuaweicloud.com (Replace it with the pip source address of the corresponding region.)
pip install py-spy

View the stack. For details about how to use the py-spy tool, see the py-spy official document.

# Find the PID of the training process.
ps -ef

# Check the process stack of process 12345.
# For a training job using eight cards, run the following command to check the stacks of the eight processes started by the main process in sequence.
py-spy dump --pid 12345

Parent topic: Managing Model Training Jobs

Previous topic: Priority of a Training Job

Next topic: Rebuilding, Stopping, or Deleting a Training Job