Help Center/ ModelArts/ Best Practices/ LLM Inference/ Adapting Mainstream Open-Source Models to Ascend-vLLM for NPU Inference Based on Lite Server (New)/ Inference Service Performance Evaluation/ LLM Inference Performance Test

Updated on 2025-11-04 GMT+08:00

View PDF

LLM Inference Performance Test

The acs-bench tool is required for the performance test. The acs-bench prof command is used to run an LLM performance benchmark test. You can set the data length and quantity to evaluate Ascend-vLLM service performance under various request loads. It supports both ramp-up tests and performance stress tests.

Installing the acs-bench Tool

The ModelArts 6.5.906 and newer versions include the pre-installed acs_bench-1.0.1-py3-none-any.whl package for the acs-bench tool. No separate installation is required.

Check if the acs-bench tool is already installed:

$ pip show acs-bench

To install the acs-bench tool, follow these steps:

Obtain the acs-bench tool's whl package. The whl package is located in the llm_tools directory within the AscendCloud-LLM-xxx.zip software package. The acs-bench tool should be installed in the Python runtime environment, which can access the inference service to be tested. It is recommended to perform this operation in the container where the inference service is started.

(Optional) Configure the pip source according to your actual needs.

$ mkdir -p ~/.pip
$ vim ~/.pip/pip.conf
# Add the following content to the configuration file. The example below uses the Huawei source: 
[global]
index-url=https://mirrors.tools.huawei.com/pypi/simple
trusted-host=mirrors.tools.huawei.com
timeout = 120

(Optional) Install the acs-bench tool:

$ pip install llm_tools/acs_bench-*-py3-none-any.whl

Preparations: Configuring providers.yaml

The acs-bench tool accesses the server through the providers.yaml configuration file, which contains information such as the server's id, name, api_key, base_url, model_name, and model_category.

Before using the acs-bench tool, create a providers.yaml file locally, fill in the parameter values according to your actual situation, and save it. The following is an example:

providers:
  -
    id: 'ascend-vllm'
    name: 'ascend-vllm'
    api_key: 'EMPTY'
    base_url: 'http://server_ip:port/v1'
    model_name: 'Qwen3-32b'
    model_category: 'Qwen3-32b'

Table 1 describes the parameters.

**Table 1** Parameters in the **providers.yaml** file
Field	Mandatory	Description
id	No	Identifier for the service provider.
name	No	Name of the service provider.
api_key	No	Originally the api_key for OpenAI, now it can be used as the MaaS authentication code.
base_url	Yes	Base URL of the server (add a URL similar to http://{$IP_address}:{$port}/v1).
model_name	Yes	Model name used when starting the inference service. If the served-model-name parameter is set when starting the inference service, use the value of served-model-name. If the served-model-name parameter is not set, use the default model path when starting the service.
model_category	No	Category of the model, can be omitted.

Obtaining Datasets

This section introduces how to obtain datasets.

The acs-bench tool requires datasets for testing and currently supports open-source datasets such as LongBench and ShareGPT formats. If the open-source datasets are not available locally, you can use the acs-bench generate dataset command to generate custom datasets. Below are examples of how to use this command. For details about the parameter descriptions, see Dataset Generation Parameter Description.

Generate a random dataset.

$ acs-bench generate dataset \
--tokenizer ./tokenizer/Qwen3-32b \
--dataset-type random \
--output-path ./built_in_dataset \
--input-length 128 \
--num-requests 100

Generate an embedding dataset.

$ acs-bench generate dataset \
--tokenizer ./tokenizer/Qwen3-32b \
--task embedding \
--output-path ./built_in_dataset \
--input-length 128 \
--num-requests 100

Generate a reranking dataset.

$ acs-bench generate dataset \
--tokenizer ./tokenizer/Qwen3-32b \
--task rerank \
--document-size 4 \
--output-path ./built_in_dataset \
--input-length 128 \
--num-requests 100

Filter datasets from LongBench.

$ acs-bench generate dataset \
--tokenizer ./tokenizer/Qwen3-32b \
--dataset-type LongBench \
--input-path ./dataset/long_bench --output-path ./built_in_dataset \
--input-length 128 \
--num-requests 100

Filter datasets from ShareGPT.

$ acs-bench generate dataset \
--tokenizer ./tokenizer/Qwen3-32b \
--dataset-type ShareGPT \
--input-path ./dataset/ShareGPT --output-path ./built_in_dataset \
--input-length 128 \
--num-requests 100

LongBench download link: https://huggingface.co/datasets/zai-org/LongBench/tree/main
ShareGPT download link: https://huggingface.co/datasets/shibing624/sharegpt_gpt4

The --input-length and --num-requests parameters in the dataset generation command only support single values.

If you need to generate datasets with different specifications, modify the --input-length or --num-requests parameters to the desired values and then execute the command.

Dataset Generation Parameter Description

The command to query the dataset generation parameters is as follows:

$ acs-bench generate dataset -h

**Table 2** Custom dataset generation command parameters
Parameter	Type	Mandatory	Description
-dt/--dataset-type	String	No	Specifies the source of the dataset to be generated, which is the type of open-source dataset used for data filtering. The default value is random, which means a random token combination mode.
-i/--input-path	String	No	Specifies the path to the open-source dataset used for data filtering. This parameter is not required when --dataset-type is set to random.
-mt/--modal-type	String	No	The mode type for multimodal datasets. The default value is text. Current options are text, image-text, and video-text.
-tk/--task	String	No	The task backend for the dataset. The default value is generate. Current options are generate, rerank, and embedding.
-cfg/--config-option	String	No	Multimodal configuration options, specified in the form of "KEY:VALUE" pairs. Multiple "KEY:VALUE" pairs can be provided. Allowed key options include image_height, image_width, duration, and fps.
-o/--output-path	String	Yes	The output path for the generated JSON file containing prompts.
-il/ --input-length	Int	Yes	The length of each prompt in the custom dataset.
-pl/--prefix-length	Int	No	The length of the common prefix prompt in the custom dataset, effective only in random mode. The default is 0.
-n/--num-requests	Int	Yes	The number of prompts to be generated.
-ds/--document-size	Int	No	The document size for each query. The default is 4.
-t/ --tokenizer	String	Yes	The path to the tokenizer model folder, supporting both local paths and Hugging Face model paths.
-rv/--revision	String	No	Specifies the model branch in the Hugging Face community, applicable only when the tokenizer is a Hugging Face model path. The default is master.
-ra/--range-ratio-above	Float	No	The ratio by which the prompt length can dynamically increase. The maximum length is input_length x (1 + range_ratio_above). The value range is [0, 1]. The default is 0.0.
-rb/--range-ratio-below	Float	No	The ratio by which the prompt length can dynamically decrease. The minimum length is input_length x (1 – range_ratio_above). The value range is [0, 1]. The default is 0.0.
-seed/--random-seed	Int	No	The random seed used to fix randomness.
-trc/--trust-remote-code	Bool	No	Specifies whether to trust remote code, applicable only when the tokenizer is a Hugging Face model path. The default is False.

Performance Stress Testing Mode Verification

An example of using the acs-bench prof command for performance stress testing is shown below. For details about parameter descriptions, see Parameter Descriptions for Usage Example. For details about output artifact descriptions, see Artifact Description.

# Use a thread pool for concurrent testing. The default backend concurrency mode is threading-pool. You can also choose the asynchronous coroutine concurrency mode asyncio, the multi-process mode processing-pool, or the multi-thread mode threading-pool.
$ acs-bench prof \
--provider ./provider/providers.yaml \
--dataset-type custom --input-path ./built_in_dataset/ \
--concurrency-backend threading-pool \
--backend openai-chat --warmup 1 \
--epochs 2 \
--num-requests 1,2,4,8 --concurrency 1,2,4,8 \
--input-length 128,128,2048,2048 --output-length 128,2048,128,2048 \
--benchmark-save-path ./output_path/

Ramp-Up Mode Verification

An example of using the acs-bench prof command for ramp-up testing is shown below. For details about parameter descriptions, see Parameter Descriptions for Usage Example. For details about output artifact descriptions, see Artifact Description.

# Example using the multi-thread concurrency mode, starting with a concurrency of 1 and increasing by 2 every 5,000ms. 
$ acs-bench prof \
--provider ./provider/providers.yaml \
--dataset-type custom --input-path ./built_in_dataset/ \
--concurrency-backend threading-pool \
--backend openai-chat --warmup 1 \
--epochs 2 \
--use-climb --climb-mode linear --growth-rate 2 --init-concurrency 1  --growth-interval 5000 \
--num-requests 1,2,4,8 --concurrency 1,2,4,8 \
--input-length 128,128,2048,2048 --output-length 128,2048,128,2048 \
--benchmark-save-path ./output_path/

Concurrent Testing of Embedding Models Using an Embedding Dataset

An example of using the acs-bench prof command for concurrent testing of an embedding model with an embedding dataset is shown below. For details about parameter descriptions, see Parameter Descriptions for Usage Example. For details about output artifact descriptions, see Artifact Description.

# Example using multi-thread concurrency mode, with the backend set to embedding
$ acs-bench prof \
--provider ./provider/providers.yaml \
--dataset-type custom --input-path ./built_in_dataset/ \
--concurrency-backend threading-pool \
--backend embedding --warmup 1 \
--epochs 2 \
--num-requests 1,2,4,8 --concurrency 1,2,4,8 \
--input-length 128,128,2048,2048 --output-length 128,2048,128,2048 \
--benchmark-save-path ./output_path/

Concurrent Testing of Reranking Models Using a Rerank Dataset

An example of using the acs-bench prof command for concurrent testing of a reranking model with an embedding dataset is shown below. For details about parameter descriptions, see Parameter Descriptions for Usage Example. For details about output artifact descriptions, see Artifact Description.

# Example using multi-thread concurrency mode, with the backend set to rerank
$ acs-bench prof \
--provider ./provider/providers.yaml \
--dataset-type custom --input-path ./built_in_dataset/ \
--concurrency-backend threading-pool \
--backend rerank --warmup 1 \
--document-size 4,4,4,4 \
--epochs 2 \
--num-requests 1,2,4,8 --concurrency 1,2,4,8 \
--input-length 128,128,2048,2048 --output-length 128,2048,128,2048 \
--benchmark-save-path ./output_path/

Parameter Descriptions for Usage Example

1. The parameters --concurrency, --init-concurrency, --num-requests, --input-length, --output-length, and --config-option can be specified using either a comma-separated list (without spaces between commas) or by specifying multiple groups of parameters separately. If the --num-requests parameter is not specified, it defaults to the same value as the --concurrency parameter.

Example:

$ acs-bench prof \
--provider ./provider/providers.yaml \
--dataset-type custom --input-path ./built_in_dataset/ \
--concurrency-backend threading-pool \
--backend openai-chat --warmup 1 \
--epochs 2 \
--num-requests 1 --num-requests 2 --num-requests 4 --num-requests 8 \
--concurrency 1 --concurrency 2 --concurrency 4 --concurrency 8 \
--input-length 128 --input-length 128 --input-length 2048 --input-length 2048 \
--output-length 128 --output-length 2048 --output-length 128 --output-length 2048 \
--benchmark-save-path ./output_path/

2. The --input-length parameter values in the performance stress testing and ramp-up testing examples must exist in the pre-generated dataset. If they do not exist, refer to Obtaining Datasets to generate a dataset with the corresponding input length.

The performance benchmark test parameters are primarily composed of four parts: Dataset Options, Concurrency Options, Metrics Options, and Serving Options. The following sections will introduce these configuration types.

Query acs-bench test parameters:

$ acs-bench prof -h

**Table 3** **Dataset Options**
Parameter	Type	Mandatory	Description
-dt/--dataset-type	String	No	Specifies the type of dataset, the default value is custom, which means a user-defined dataset.
-cfg/--config-option	String	No	Multimodal configuration options, specified in the form of "KEY:VALUE" pairs. Multiple "KEY1:VALUE1,KEY2:VALUE2" pairs can be provided. Allowed key options include image_height, image_width, duration, and fps.
-mt/--modal-type	String	No	The mode type for multimodal datasets. The default value is text. Current options are text, image-text, and video-text.
-i/--input-path	String	Yes	Specifies the path for the dataset.
-il/--input-length	Int	Yes	Specifies the length of the custom dataset, only effective when dataset-type is custom, supports specifying multiple input lengths, separated by ",".
-ds/--document-size	Int	No	The document size for each query. This parameter supports multiple integer inputs.
-n/--num-requests	Int	Yes	The number of requests for concurrent testing. This parameter supports specifying multiple request numbers, separated by ",". It defaults to the same as the concurrency number.
-t/--tokenizer	String	No	The path to the tokenizer model folder, supporting both local paths and Hugging Face model paths.
-rv/--revision	String	No	Specifies the model branch in the Hugging Face community, applicable only when the tokenizer is a Hugging Face model path. The default is master.
-seed/--random-seed	Int	No	The random seed used to fix randomness.
-trc/--trust-remote-code	Bool	No	Specifies whether to trust remote code, applicable only when the tokenizer is a Hugging Face model path. The default is False.

**Table 4** Concurrency Options
Parameter	Type	Mandatory	Description
-c/--concurrency	Int	No	Maximum concurrency level. The default value is 1. The parameter supports specifying multiple concurrency levels, separated by ",".
-nc/--num-process	Int	No	Number of processes for parallel processing, which should be less than or equal to the number of CPUs. This parameter supports specifying multiple process numbers, separated by ",". The default value is [1].
-r/--request-rate	Float	No	Request arrival rate, only effective when concurrency is 1. The default value is infinity (INF).
-rm/--request-mode	String	No	Request arrival mode, which supports normal and pd-adaptive. The default value is normal.
-pc/--prefill-concurrency	INT	No	Maximum concurrency for all prefill operations in PD aggregation, only effective when --request-mode is pd-adaptive.
-dc/--decoder-concurrency	INT	No	Maximum concurrency for all decode operations in PD aggregation, only effective when --request-mode is pd-adaptive.
-burst/--burstiness	Float	No	Burst factor for requests, only effective when request_rate is not inf. The default value is 1.0.
-cb/--concurrency-backend	Str	No	Concurrency backend, which defaults to threading-pool. Supported options: threading-pool: Single-process multi-threaded concurrency backend. asyncio: Asynchronous coroutine concurrency backend. processing-pool: Multi-process concurrency backend. When the concurrency backend is processing-pool, num-process must be less than or equal to min(concurrency, init_concurrency). If this condition is not met, the tool will automatically set the smaller of concurrency or init_concurrency to num-process.
-ub/--use-climb	Bool	No	Specifies whether to enable the ramp-up mode. The default value is False, indicating that the ramp-up mode is not enabled.
-gr/--growth-rate	Int	No	Concurrency growth rate for each ramp-up, only effective in ramp-up mode. The default value is 0.
-gi/--growth-interval	Float	No	Time interval for each ramp-up, only effective in ramp-up mode. The default value is 1,000 ms.
-ic/--init-concurrency	Int	No	Initial concurrency level, only effective in ramp-up mode. The default value is equal to concurrency. This parameter supports specifying multiple initial concurrency levels, separated by ",".
-cm/--climb-mode	String	No	Ramp-up mode, only effective in ramp-up mode. The default value is linear. Supported options: static: Concurrency level remains constant, equivalent to stress testing mode. linear: Concurrency level increases linearly over time intervals until it reaches the maximum concurrency level.

**Table 5** **Metrics Options**
Parameter	Type	Mandatory	Description
-g, --goodput	String	No	Service SLO, indicating performance metrics that meet business requirements. The unit is milliseconds (ms). Supported metric types: ttft, tpot, and e2el. You can use -g ttft:50 -g e2e:1000 to specify ttft and e2e metrics.
-bi, --bucket-interval	Float	No	Indicates the sampling interval for real-time performance metrics. The unit is ms. If this parameter is specified, it can dynamically monitor the changes in performance metrics within the bucket_interval ms.

**Table 6** **Serving Options**
Parameter	Type	Mandatory	Description
-b, --backend	String	No	Type of request service API, which can be openai, openai-chat, embedding, or rerank. The default value is openai-chat.
-p, --provider	String	Yes	Path to the provider file, which needs to be created and specified by you.
-pid, --provider-id	String	No	Specifies the provider ID to be tested, useful when the provider file contains multiple configurations and only a specific ID needs to be run.
-ol/--output-length	Int	Yes	Length of output tokens. You can specify multiple output lengths, separated by ",".
-ra/--range-ratio-above	Float	No	The ratio by which the output token length can dynamically increase. The maximum length is output_length x (1 + range_ratio_above). The value range is [0, 1]. The default is 0.0.
-rb/--range-ratio-below	Bool	No	The ratio by which the output token length can dynamically decrease. The minimum length is output_length x (1 – range_ratio_above). The value range is [0, 1]. The default is 0.0.
-w/--warmup	Int	No	Number of warmup requests. The default value is 0, indicating that warmup is not enabled.
-e/--epochs	Int	No	Number of times to run the same concurrency configuration. The default value is 1, indicating that each concurrency group runs only once.
-tk/--top-k	Int	No	Top-k sampling parameter, only effective for OpenAI-compatible backends. The default value is -1.
-tp/--top-p	Float	No	Top-p sampling parameter, only effective for OpenAI-compatible backends. The default value is 1.0.
-mp/--min-p	Float	No	Min-p sampling parameter, indicating the minimum probability for a token to be considered. Must be in the range [0, 1], only effective for OpenAI-compatible backends.
-temper/--temperature	Float	No	Temperature sampling parameter. The default value is 0.
-cs/--chunk-size	Int	No	Chunk size in stream requests. The default value is 1024.
-ef/--encoding-format	String	No	Encoding format for the backend's response. The value can be float or base64. The default value is float.
-usd/--use-spec-decode	Bool	No	Indicates whether speculative inference is enabled on the server. If enabled, it can be combined with the --num-spec-tokens parameter to calculate the MTP acceptance rate. The default value is False, indicating that speculative inference is disabled.
-nst/--num-spec-tokens	Int	No	Number of speculative inference tokens configured on the server. A value of 1 indicates that the server will infer one additional token each time, which can be combined with the --use-spec-decode parameter to calculate the MTP acceptance rate. The default value is -1.
-umar/--use-mtp-accept-rate	Bool	No	Indicates whether to ignore the number of tokens generated by the model when calculating the MTP acceptance rate. The default value is True, indicating that the number of tokens generated by the model is ignored.
-nss/--num-scheduler-steps	Int	No	Size of multi step on the server, used to calculate the MTP acceptance rate. The default value is 1.
-timeout/--timeout	Float	No	Request timeout time. The default value is 1,000s.
-ie/--ignore-eos	Bool	No	Specifies whether to ignore EOS. The default value is True, indicating that EOS is ignored.
-cus/--continuous-usage-stats	Bool	No	Specifies whether to include usage information in each returned chunk in stream requests. The default value is True, indicating that usage information is included in each returned chunk.
-sst/--skip-special-tokens	Bool	No	Specifies whether to skip special tokens. The default value is False.
-er/--enable-max-tokens-exclude-reasoning	Bool	No	Specifies whether to enable max-tokens excluding reasoning, which will proactively disconnect from the server when max-tokens is reached. The default value is True.
-pf/ --profile	Bool	No	Specifies whether to collect Service Profiler information from the server. The default value is False, indicating that the information is not collected. The warmup phase does not collect server service information.
-pl/--profile-level	String	No	Collection level for the server's Service Profiler. The options include Level_none, Level0, Level1, and Level2. This parameter is only effective when profiling is enabled; its default value is none.
-trace/--trace	Bool	No	Specifies whether to enable the tool's trace switch, which will monitor and display the concurrency process. The default value is False, indicating that it is not enabled.
-s/--benchmark-save-path	String	No	Path to the folder where performance metrics are saved. The default value is ./benchmark_output.

Artifact Description

After the script runs, a requests directory and a summary CSV file will be created in the output path specified by the --benchmark-save-path parameter. Inside the requests directory, a CSV file starting with requests will be output.

1.requests_{provider}_{dataset_type}_{control_method}_concurrency{concurrency}_{concurrency_backend}_input{input_length}_output{output_length}_{current_time}.csv

2.summary_{provider}_{control_method}_{concurrency_backend}_{current_time}.csv

An example is shown in the figure below.

Figure 1 Request details
Click to enlarge

Figure 2 CSV file for performance metrics
Click to enlarge

Parent topic: Inference Service Performance Evaluation

Previous topic: Inference Service Performance Evaluation

Next topic: Multimodal Model Inference Performance Test

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.

The system is busy. Please try again later.

Which of the following issues have you encountered?

Content is inconsistent with the product UI

Unclear descriptions

Lack of examples or code

Incorrect steps

Can't find what I need

Lack of best practices

Feedback (optional)

0/500

Select at least one type of issue, and enter your comments or suggestions.

Enter a maximum of 500 characters.

Submit Cancel

For any further questions, feel free to contact us through the chatbot.

Chatbot