Obtaining Model Inference Profiling Data

PyTorch Profiler is a performance analysis tool provided by PyTorch, used to deeply analyze performance bottlenecks during the model training/inference process, helping developers optimize computational efficiency, memory usage, and hardware utilization.

Ascend PyTorch Profiler is fully aligned with the usage in PyTorch-GPU scenarios, supporting the collection of PyTorch layer operator information, CANN layer operator information, underlying NPU operator information, and operator memory usage information, enabling a comprehensive analysis of the performance status of PyTorch AI tasks.

However, using PyTorch Profiler can result in large data volumes, longer data collection times, and performance overhead that may lead to inaccurate data and distorted results. To address these issues, a lightweight performance analysis tool called Service Profiler has been introduced, which is used to analyze performance issues at the service request level. Service Profiler currently gathers profiling data of interest to users by pre-instrumenting key points within the service framework. The current capabilities supported include observing the batch size of internal service requests, sequence length, and the execution time of a single batch iteration.

Constraints

Before using Service Profiler, ensure that the inference service can be started and handle requests normally. The Service Profiler is now included in the versioned image as a Python library.

Checking if the Service Profiler Tool is Installed

In ModelArts 6.5.906 and later, the acs_service_profiler-1.0.1-py3-none-any.whl package is installed by default, so there is no need for a separate installation. The package is located in the llm_tools directory within the AscendCloud-LLM-xxx.zip software package.

Check if the acs-service-profiler tool is already installed:

$ pip show acs-service-profiler

If it is not installed, refer to Installing the acs-bench Tool for instructions on installing the acs-bench tool. The installation command is as follows:

$ pip install llm_tools/acs_service_profiler-*-py3-none-any.whl

Note: Both Ascend PyTorch Profiler and Service Profiler are features enabled during the performance tuning phase of development and are not recommended for use in production service states. Generally, using Ascend PyTorch Profiler involves collecting a small amount of request data (one or two requests) for analysis, while Service Profiler collects data over a period of requests (hundreds or thousands) for analysis. The following section explains how to collect data using Ascend PyTorch Profiler and Service Profiler in a real-time service scenario.

Real-Time Service Profiling Through start_profile and stop_profile

Before starting the inference service, set the environment variables:
```
export VLLM_TORCH_PROFILER_DIR=/home/ma-user/profiler_dir # Enable Ascend PyTorch Profiler
# export VLLM_SERVICE_PROFILER_DIR =/home/ma-user/profiler_dir # Enable Service Profiler 
```
VLLM_TORCH_PROFILER_DIR/VLLM_SERVICE_PROFILER_DIR is used to enable the Ascend PyTorch Profiler or Service Profiler. The collected profiler data is stored in the path specified by the environment variable. Note that both cannot be enabled simultaneously.
After setting the environment variables, start the inference service.
For details about how to start the inference service, see Starting an LLM-powered Inference Service.
Send a start_profile POST request.
```
curl -X POST http://${IP}:${PORT}/start_profile
```
Parameters
1. IP: The IP address where the service is deployed.
2. PORT: The port where the service is deployed.
Send an actual request.
For sending actual requests, see LLM Inference Performance Test.
Send a stop_profile POST request.
```
curl -X POST http://${IP}:${PORT}/stop_profile
```
The parameters are same as the start_profile POST request.

Perform post-processing and visualization.

For visualizing data collected by Ascend PyTorch Profiler, it is recommended to use the MindStudio Insight tool. The visualization effect is shown in the following figure.

Click to enlarge

For more information on MindStudio Insight, see the MindStudio Insight tool documentation.

To visualize data collected by Service Profiler, use the acsprof tool for post-processing and then visualize the data in a web page that supports the Google tracing format. The specific steps are as follows:

Post-Processing to Generate Visualization Files

acsprof export -i ${input_path}

The following table describes the parameters.

Parameter	Type	Description	Mandatory
-i / --input_path	String	Specifies the path to the Service Profiler collection folder, supporting both parent and subfolders.	Yes
-o / --output_path	String	Specifies the output path for the post-processed files, defaulting to the input folder path.	No
-f / --force_reparse	Bool	Specifies whether to perform forced re-parsing for already parsed folders. The default value is False (no forced re-parsing). In scenarios where multiple batches of data are collected, the first batch will be parsed automatically, and subsequent batches will not be parsed automatically. Set this to True to enable forced re-parsing.	No

Example:

acsprof export -i /home/ma-user/profiler_dir

Normal log output

Click to enlarge

Post-processing parses the profiler data once more. It exports metrics like TTFT, TPOT, and framework throughput. It also creates a visualization file named trace_view.json. For multiple instances, the tool combines their timeline data into an overview_trace_view.json file. You can drag these files into chrome://tracing/ or https://ui.perfetto.dev/ for visual analysis.

Result visualization

Request arrival or end time
Service process group batch details
The details include the execution time for a single group batch process, request ID, batch size, sequence length, and statistics on whether the request is in the prompt or decode phase.