Help Center> ModelArts> Model Development> Performing a Training> Cloud Shell> Analyzing the Call Stack of the Suspended Process Using the py-spy Tool and Locating the Suspended Problem By Analyzing Code
Updated on 2024-05-07 GMT+08:00

Analyzing the Call Stack of the Suspended Process Using the py-spy Tool and Locating the Suspended Problem By Analyzing Code

Scenarios

If a process is suspended, you can analyze the call stack of the process with the py-spy tool and locate the suspended problem by analyzing code.

Procedure

  1. On the ModelArts console, choose Training Management > Training Jobs.
  2. Click the target training job to go to its details page. On the page that appears, click the Cloud Shell tab and log in to the training container (the training job must be in the Running state).
  3. Install the py-spy tool.

    # Use the utils.sh script to automatically configure the Python environment.
    source /home/ma-user/modelarts/run/utils.sh
    
    # Install py-spy.
    pip install py-spy
    
    # If the message "connection broken by 'ProxyError('Cannot connect to proxy.')" is displayed, disable the proxy.
    export no_proxy=$no_proxy,repo.myhuaweicloud.com (Replace it with the pip source address of the corresponding site.)'
    pip install py-spy

  4. Check the stacks. For details about how to use the py-spy tool, see the py-spy official document.

    # Find the PID of the training process.
    ps -ef
    
    # Check the process stack of process 12345.
    # For a training job using eight cards, run the following command to check the stacks of the eight processes started by the main process in sequence.
    py-spy dump --pid 12345