Updated on 2025-11-04 GMT+08:00

LLM Inference Performance Test

The acs-bench tool is required for the performance test. The acs-bench prof command is used to run an LLM performance benchmark test. You can set the data length and quantity to evaluate Ascend-vLLM service performance under various request loads. It supports both ramp-up tests and performance stress tests.

Installing the acs-bench Tool

The ModelArts 6.5.906 and newer versions include the pre-installed acs_bench-1.0.1-py3-none-any.whl package for the acs-bench tool. No separate installation is required.

Check if the acs-bench tool is already installed:

$ pip show acs-bench

To install the acs-bench tool, follow these steps:

  1. Obtain the acs-bench tool's whl package. The whl package is located in the llm_tools directory within the AscendCloud-LLM-xxx.zip software package. The acs-bench tool should be installed in the Python runtime environment, which can access the inference service to be tested. It is recommended to perform this operation in the container where the inference service is started.
  2. (Optional) Configure the pip source according to your actual needs.
    $ mkdir -p ~/.pip
    $ vim ~/.pip/pip.conf
    # Add the following content to the configuration file. The example below uses the Huawei source: 
    [global]
    index-url=https://mirrors.tools.huawei.com/pypi/simple
    trusted-host=mirrors.tools.huawei.com
    timeout = 120
  3. (Optional) Install the acs-bench tool:
    $ pip install llm_tools/acs_bench-*-py3-none-any.whl

Preparations: Configuring providers.yaml

The acs-bench tool accesses the server through the providers.yaml configuration file, which contains information such as the server's id, name, api_key, base_url, model_name, and model_category.

Before using the acs-bench tool, create a providers.yaml file locally, fill in the parameter values according to your actual situation, and save it. The following is an example:

providers:
  -
    id: 'ascend-vllm'
    name: 'ascend-vllm'
    api_key: 'EMPTY'
    base_url: 'http://server_ip:port/v1'
    model_name: 'Qwen3-32b'
    model_category: 'Qwen3-32b'

Table 1 describes the parameters.

Table 1 Parameters in the providers.yaml file

Field

Mandatory

Description

id

No

Identifier for the service provider.

name

No

Name of the service provider.

api_key

No

Originally the api_key for OpenAI, now it can be used as the MaaS authentication code.

base_url

Yes

Base URL of the server (add a URL similar to http://{$IP_address}:{$port}/v1).

model_name

Yes

Model name used when starting the inference service.

If the served-model-name parameter is set when starting the inference service, use the value of served-model-name.

If the served-model-name parameter is not set, use the default model path when starting the service.

model_category

No

Category of the model, can be omitted.

Obtaining Datasets

This section introduces how to obtain datasets.

The acs-bench tool requires datasets for testing and currently supports open-source datasets such as LongBench and ShareGPT formats. If the open-source datasets are not available locally, you can use the acs-bench generate dataset command to generate custom datasets. Below are examples of how to use this command. For details about the parameter descriptions, see Dataset Generation Parameter Description.

  1. Generate a random dataset.
    $ acs-bench generate dataset \
    --tokenizer ./tokenizer/Qwen3-32b \
    --dataset-type random \
    --output-path ./built_in_dataset \
    --input-length 128 \
    --num-requests 100
  2. Generate an embedding dataset.
    $ acs-bench generate dataset \
    --tokenizer ./tokenizer/Qwen3-32b \
    --task embedding \
    --output-path ./built_in_dataset \
    --input-length 128 \
    --num-requests 100
  3. Generate a reranking dataset.
    $ acs-bench generate dataset \
    --tokenizer ./tokenizer/Qwen3-32b \
    --task rerank \
    --document-size 4 \
    --output-path ./built_in_dataset \
    --input-length 128 \
    --num-requests 100
  4. Filter datasets from LongBench.
    $ acs-bench generate dataset \
    --tokenizer ./tokenizer/Qwen3-32b \
    --dataset-type LongBench \
    --input-path ./dataset/long_bench --output-path ./built_in_dataset \
    --input-length 128 \
    --num-requests 100
  5. Filter datasets from ShareGPT.
    $ acs-bench generate dataset \
    --tokenizer ./tokenizer/Qwen3-32b \
    --dataset-type ShareGPT \
    --input-path ./dataset/ShareGPT --output-path ./built_in_dataset \
    --input-length 128 \
    --num-requests 100

The --input-length and --num-requests parameters in the dataset generation command only support single values.

If you need to generate datasets with different specifications, modify the --input-length or --num-requests parameters to the desired values and then execute the command.

Dataset Generation Parameter Description

The command to query the dataset generation parameters is as follows:

$ acs-bench generate dataset -h
Table 2 Custom dataset generation command parameters

Parameter

Type

Mandatory

Description

-dt/--dataset-type

String

No

Specifies the source of the dataset to be generated, which is the type of open-source dataset used for data filtering. The default value is random, which means a random token combination mode.

-i/--input-path

String

No

Specifies the path to the open-source dataset used for data filtering. This parameter is not required when --dataset-type is set to random.

-mt/--modal-type

String

No

The mode type for multimodal datasets. The default value is text. Current options are text, image-text, and video-text.

-tk/--task

String

No

The task backend for the dataset. The default value is generate. Current options are generate, rerank, and embedding.

-cfg/--config-option

String

No

Multimodal configuration options, specified in the form of "KEY:VALUE" pairs. Multiple "KEY:VALUE" pairs can be provided. Allowed key options include image_height, image_width, duration, and fps.

-o/--output-path

String

Yes

The output path for the generated JSON file containing prompts.

-il/ --input-length

Int

Yes

The length of each prompt in the custom dataset.

-pl/--prefix-length

Int

No

The length of the common prefix prompt in the custom dataset, effective only in random mode. The default is 0.

-n/--num-requests

Int

Yes

The number of prompts to be generated.

-ds/--document-size

Int

No

The document size for each query. The default is 4.

-t/ --tokenizer

String

Yes

The path to the tokenizer model folder, supporting both local paths and Hugging Face model paths.

-rv/--revision

String

No

Specifies the model branch in the Hugging Face community, applicable only when the tokenizer is a Hugging Face model path. The default is master.

-ra/--range-ratio-above

Float

No

The ratio by which the prompt length can dynamically increase. The maximum length is input_length x (1 + range_ratio_above). The value range is [0, 1]. The default is 0.0.

-rb/--range-ratio-below

Float

No

The ratio by which the prompt length can dynamically decrease. The minimum length is input_length x (1 – range_ratio_above). The value range is [0, 1]. The default is 0.0.

-seed/--random-seed

Int

No

The random seed used to fix randomness.

-trc/--trust-remote-code

Bool

No

Specifies whether to trust remote code, applicable only when the tokenizer is a Hugging Face model path. The default is False.

Performance Stress Testing Mode Verification

An example of using the acs-bench prof command for performance stress testing is shown below. For details about parameter descriptions, see Parameter Descriptions for Usage Example. For details about output artifact descriptions, see Artifact Description.
# Use a thread pool for concurrent testing. The default backend concurrency mode is threading-pool. You can also choose the asynchronous coroutine concurrency mode asyncio, the multi-process mode processing-pool, or the multi-thread mode threading-pool.
$ acs-bench prof \
--provider ./provider/providers.yaml \
--dataset-type custom --input-path ./built_in_dataset/ \
--concurrency-backend threading-pool \
--backend openai-chat --warmup 1 \
--epochs 2 \
--num-requests 1,2,4,8 --concurrency 1,2,4,8 \
--input-length 128,128,2048,2048 --output-length 128,2048,128,2048 \
--benchmark-save-path ./output_path/

Ramp-Up Mode Verification

An example of using the acs-bench prof command for ramp-up testing is shown below. For details about parameter descriptions, see Parameter Descriptions for Usage Example. For details about output artifact descriptions, see Artifact Description.
# Example using the multi-thread concurrency mode, starting with a concurrency of 1 and increasing by 2 every 5,000ms. 
$ acs-bench prof \
--provider ./provider/providers.yaml \
--dataset-type custom --input-path ./built_in_dataset/ \
--concurrency-backend threading-pool \
--backend openai-chat --warmup 1 \
--epochs 2 \
--use-climb --climb-mode linear --growth-rate 2 --init-concurrency 1  --growth-interval 5000 \
--num-requests 1,2,4,8 --concurrency 1,2,4,8 \
--input-length 128,128,2048,2048 --output-length 128,2048,128,2048 \
--benchmark-save-path ./output_path/

Concurrent Testing of Embedding Models Using an Embedding Dataset

An example of using the acs-bench prof command for concurrent testing of an embedding model with an embedding dataset is shown below. For details about parameter descriptions, see Parameter Descriptions for Usage Example. For details about output artifact descriptions, see Artifact Description.

# Example using multi-thread concurrency mode, with the backend set to embedding
$ acs-bench prof \
--provider ./provider/providers.yaml \
--dataset-type custom --input-path ./built_in_dataset/ \
--concurrency-backend threading-pool \
--backend embedding --warmup 1 \
--epochs 2 \
--num-requests 1,2,4,8 --concurrency 1,2,4,8 \
--input-length 128,128,2048,2048 --output-length 128,2048,128,2048 \
--benchmark-save-path ./output_path/

Concurrent Testing of Reranking Models Using a Rerank Dataset

An example of using the acs-bench prof command for concurrent testing of a reranking model with an embedding dataset is shown below. For details about parameter descriptions, see Parameter Descriptions for Usage Example. For details about output artifact descriptions, see Artifact Description.

# Example using multi-thread concurrency mode, with the backend set to rerank
$ acs-bench prof \
--provider ./provider/providers.yaml \
--dataset-type custom --input-path ./built_in_dataset/ \
--concurrency-backend threading-pool \
--backend rerank --warmup 1 \
--document-size 4,4,4,4 \
--epochs 2 \
--num-requests 1,2,4,8 --concurrency 1,2,4,8 \
--input-length 128,128,2048,2048 --output-length 128,2048,128,2048 \
--benchmark-save-path ./output_path/

Parameter Descriptions for Usage Example

1. The parameters --concurrency, --init-concurrency, --num-requests, --input-length, --output-length, and --config-option can be specified using either a comma-separated list (without spaces between commas) or by specifying multiple groups of parameters separately. If the --num-requests parameter is not specified, it defaults to the same value as the --concurrency parameter.

Example:

$ acs-bench prof \
--provider ./provider/providers.yaml \
--dataset-type custom --input-path ./built_in_dataset/ \
--concurrency-backend threading-pool \
--backend openai-chat --warmup 1 \
--epochs 2 \
--num-requests 1 --num-requests 2 --num-requests 4 --num-requests 8 \
--concurrency 1 --concurrency 2 --concurrency 4 --concurrency 8 \
--input-length 128 --input-length 128 --input-length 2048 --input-length 2048 \
--output-length 128 --output-length 2048 --output-length 128 --output-length 2048 \
--benchmark-save-path ./output_path/

2. The --input-length parameter values in the performance stress testing and ramp-up testing examples must exist in the pre-generated dataset. If they do not exist, refer to Obtaining Datasets to generate a dataset with the corresponding input length.

The performance benchmark test parameters are primarily composed of four parts: Dataset Options, Concurrency Options, Metrics Options, and Serving Options. The following sections will introduce these configuration types.

Query acs-bench test parameters:
$ acs-bench prof -h
Table 3 Dataset Options

Parameter

Type

Mandatory

Description

-dt/--dataset-type

String

No

Specifies the type of dataset, the default value is custom, which means a user-defined dataset.

-cfg/--config-option

String

No

Multimodal configuration options, specified in the form of "KEY:VALUE" pairs. Multiple "KEY1:VALUE1,KEY2:VALUE2" pairs can be provided. Allowed key options include image_height, image_width, duration, and fps.

-mt/--modal-type

String

No

The mode type for multimodal datasets. The default value is text. Current options are text, image-text, and video-text.

-i/--input-path

String

Yes

Specifies the path for the dataset.

-il/--input-length

Int

Yes

Specifies the length of the custom dataset, only effective when dataset-type is custom, supports specifying multiple input lengths, separated by ",".

-ds/--document-size

Int

No

The document size for each query. This parameter supports multiple integer inputs.

-n/--num-requests

Int

Yes

The number of requests for concurrent testing. This parameter supports specifying multiple request numbers, separated by ",". It defaults to the same as the concurrency number.

-t/--tokenizer

String

No

The path to the tokenizer model folder, supporting both local paths and Hugging Face model paths.

-rv/--revision

String

No

Specifies the model branch in the Hugging Face community, applicable only when the tokenizer is a Hugging Face model path. The default is master.

-seed/--random-seed

Int

No

The random seed used to fix randomness.

-trc/--trust-remote-code

Bool

No

Specifies whether to trust remote code, applicable only when the tokenizer is a Hugging Face model path. The default is False.

Table 4 Concurrency Options

Parameter

Type

Mandatory

Description

-c/--concurrency

Int

No

Maximum concurrency level. The default value is 1. The parameter supports specifying multiple concurrency levels, separated by ",".

-nc/--num-process

Int

No

Number of processes for parallel processing, which should be less than or equal to the number of CPUs. This parameter supports specifying multiple process numbers, separated by ",". The default value is [1].

-r/--request-rate

Float

No

Request arrival rate, only effective when concurrency is 1. The default value is infinity (INF).

-rm/--request-mode

String

No

Request arrival mode, which supports normal and pd-adaptive. The default value is normal.

-pc/--prefill-concurrency

INT

No

Maximum concurrency for all prefill operations in PD aggregation, only effective when --request-mode is pd-adaptive.

-dc/--decoder-concurrency

INT

No

Maximum concurrency for all decode operations in PD aggregation, only effective when --request-mode is pd-adaptive.

-burst/--burstiness

Float

No

Burst factor for requests, only effective when request_rate is not inf. The default value is 1.0.

-cb/--concurrency-backend

Str

No

Concurrency backend, which defaults to threading-pool. Supported options:

  • threading-pool: Single-process multi-threaded concurrency backend.
  • asyncio: Asynchronous coroutine concurrency backend.
  • processing-pool: Multi-process concurrency backend. When the concurrency backend is processing-pool, num-process must be less than or equal to min(concurrency, init_concurrency). If this condition is not met, the tool will automatically set the smaller of concurrency or init_concurrency to num-process.

-ub/--use-climb

Bool

No

Specifies whether to enable the ramp-up mode. The default value is False, indicating that the ramp-up mode is not enabled.

-gr/--growth-rate

Int

No

Concurrency growth rate for each ramp-up, only effective in ramp-up mode. The default value is 0.

-gi/--growth-interval

Float

No

Time interval for each ramp-up, only effective in ramp-up mode. The default value is 1,000 ms.

-ic/--init-concurrency

Int

No

Initial concurrency level, only effective in ramp-up mode. The default value is equal to concurrency. This parameter supports specifying multiple initial concurrency levels, separated by ",".

-cm/--climb-mode

String

No

Ramp-up mode, only effective in ramp-up mode. The default value is linear.

Supported options:

  • static: Concurrency level remains constant, equivalent to stress testing mode.
  • linear: Concurrency level increases linearly over time intervals until it reaches the maximum concurrency level.
Table 5 Metrics Options

Parameter

Type

Mandatory

Description

-g, --goodput

String

No

Service SLO, indicating performance metrics that meet business requirements. The unit is milliseconds (ms). Supported metric types: ttft, tpot, and e2el.

You can use -g ttft:50 -g e2e:1000 to specify ttft and e2e metrics.

-bi, --bucket-interval

Float

No

Indicates the sampling interval for real-time performance metrics. The unit is ms. If this parameter is specified, it can dynamically monitor the changes in performance metrics within the bucket_interval ms.

Table 6 Serving Options

Parameter

Type

Mandatory

Description

-b, --backend

String

No

Type of request service API, which can be openai, openai-chat, embedding, or rerank. The default value is openai-chat.

-p, --provider

String

Yes

Path to the provider file, which needs to be created and specified by you.

-pid, --provider-id

String

No

Specifies the provider ID to be tested, useful when the provider file contains multiple configurations and only a specific ID needs to be run.

-ol/--output-length

Int

Yes

Length of output tokens. You can specify multiple output lengths, separated by ",".

-ra/--range-ratio-above

Float

No

The ratio by which the output token length can dynamically increase. The maximum length is output_length x (1 + range_ratio_above). The value range is [0, 1]. The default is 0.0.

-rb/--range-ratio-below

Bool

No

The ratio by which the output token length can dynamically decrease. The minimum length is output_length x (1 – range_ratio_above). The value range is [0, 1]. The default is 0.0.

-w/--warmup

Int

No

Number of warmup requests. The default value is 0, indicating that warmup is not enabled.

-e/--epochs

Int

No

Number of times to run the same concurrency configuration. The default value is 1, indicating that each concurrency group runs only once.

-tk/--top-k

Int

No

Top-k sampling parameter, only effective for OpenAI-compatible backends. The default value is -1.

-tp/--top-p

Float

No

Top-p sampling parameter, only effective for OpenAI-compatible backends. The default value is 1.0.

-mp/--min-p

Float

No

Min-p sampling parameter, indicating the minimum probability for a token to be considered. Must be in the range [0, 1], only effective for OpenAI-compatible backends.

-temper/--temperature

Float

No

Temperature sampling parameter. The default value is 0.

-cs/--chunk-size

Int

No

Chunk size in stream requests. The default value is 1024.

-ef/--encoding-format

String

No

Encoding format for the backend's response. The value can be float or base64. The default value is float.

-usd/--use-spec-decode

Bool

No

Indicates whether speculative inference is enabled on the server. If enabled, it can be combined with the --num-spec-tokens parameter to calculate the MTP acceptance rate. The default value is False, indicating that speculative inference is disabled.

-nst/--num-spec-tokens

Int

No

Number of speculative inference tokens configured on the server. A value of 1 indicates that the server will infer one additional token each time, which can be combined with the --use-spec-decode parameter to calculate the MTP acceptance rate. The default value is -1.

-umar/--use-mtp-accept-rate

Bool

No

Indicates whether to ignore the number of tokens generated by the model when calculating the MTP acceptance rate. The default value is True, indicating that the number of tokens generated by the model is ignored.

-nss/--num-scheduler-steps

Int

No

Size of multi step on the server, used to calculate the MTP acceptance rate. The default value is 1.

-timeout/--timeout

Float

No

Request timeout time. The default value is 1,000s.

-ie/--ignore-eos

Bool

No

Specifies whether to ignore EOS. The default value is True, indicating that EOS is ignored.

-cus/--continuous-usage-stats

Bool

No

Specifies whether to include usage information in each returned chunk in stream requests. The default value is True, indicating that usage information is included in each returned chunk.

-sst/--skip-special-tokens

Bool

No

Specifies whether to skip special tokens. The default value is False.

-er/--enable-max-tokens-exclude-reasoning

Bool

No

Specifies whether to enable max-tokens excluding reasoning, which will proactively disconnect from the server when max-tokens is reached. The default value is True.

-pf/ --profile

Bool

No

Specifies whether to collect Service Profiler information from the server. The default value is False, indicating that the information is not collected.

The warmup phase does not collect server service information.

-pl/--profile-level

String

No

Collection level for the server's Service Profiler. The options include Level_none, Level0, Level1, and Level2. This parameter is only effective when profiling is enabled; its default value is none.

-trace/--trace

Bool

No

Specifies whether to enable the tool's trace switch, which will monitor and display the concurrency process. The default value is False, indicating that it is not enabled.

-s/--benchmark-save-path

String

No

Path to the folder where performance metrics are saved. The default value is ./benchmark_output.

Artifact Description

After the script runs, a requests directory and a summary CSV file will be created in the output path specified by the --benchmark-save-path parameter. Inside the requests directory, a CSV file starting with requests will be output.

1.requests_{provider}_{dataset_type}_{control_method}_concurrency{concurrency}_{concurrency_backend}_input{input_length}_output{output_length}_{current_time}.csv

2.summary_{provider}_{control_method}_{concurrency_backend}_{current_time}.csv

An example is shown in the figure below.

Figure 1 Request details
Figure 2 CSV file for performance metrics