LLM Inference Performance Test
The acs-bench tool is required for the performance test. The acs-bench prof command is used to run an LLM performance benchmark test. You can set the data length and quantity to evaluate Ascend-vLLM service performance under various request loads. It supports both ramp-up tests and performance stress tests.
Installing the acs-bench Tool
The ModelArts 6.5.906 and newer versions include the pre-installed acs_bench-1.0.1-py3-none-any.whl package for the acs-bench tool. No separate installation is required.
Check if the acs-bench tool is already installed:
$ pip show acs-bench
To install the acs-bench tool, follow these steps:
- Obtain the acs-bench tool's whl package. The whl package is located in the llm_tools directory within the AscendCloud-LLM-xxx.zip software package. The acs-bench tool should be installed in the Python runtime environment, which can access the inference service to be tested. It is recommended to perform this operation in the container where the inference service is started.
- (Optional) Configure the pip source according to your actual needs.
$ mkdir -p ~/.pip $ vim ~/.pip/pip.conf # Add the following content to the configuration file. The example below uses the Huawei source: [global] index-url=https://mirrors.tools.huawei.com/pypi/simple trusted-host=mirrors.tools.huawei.com timeout = 120
- (Optional) Install the acs-bench tool:
$ pip install llm_tools/acs_bench-*-py3-none-any.whl
Preparations: Configuring providers.yaml
The acs-bench tool accesses the server through the providers.yaml configuration file, which contains information such as the server's id, name, api_key, base_url, model_name, and model_category.
Before using the acs-bench tool, create a providers.yaml file locally, fill in the parameter values according to your actual situation, and save it. The following is an example:
providers:
-
id: 'ascend-vllm'
name: 'ascend-vllm'
api_key: 'EMPTY'
base_url: 'http://server_ip:port/v1'
model_name: 'Qwen3-32b'
model_category: 'Qwen3-32b'
Table 1 describes the parameters.
|
Field |
Mandatory |
Description |
|---|---|---|
|
id |
No |
Identifier for the service provider. |
|
name |
No |
Name of the service provider. |
|
api_key |
No |
Originally the api_key for OpenAI, now it can be used as the MaaS authentication code. |
|
base_url |
Yes |
Base URL of the server (add a URL similar to http://{$IP_address}:{$port}/v1). |
|
model_name |
Yes |
Model name used when starting the inference service. If the served-model-name parameter is set when starting the inference service, use the value of served-model-name. If the served-model-name parameter is not set, use the default model path when starting the service. |
|
model_category |
No |
Category of the model, can be omitted. |
Obtaining Datasets
This section introduces how to obtain datasets.
The acs-bench tool requires datasets for testing and currently supports open-source datasets such as LongBench and ShareGPT formats. If the open-source datasets are not available locally, you can use the acs-bench generate dataset command to generate custom datasets. Below are examples of how to use this command. For details about the parameter descriptions, see Dataset Generation Parameter Description.
- Generate a random dataset.
$ acs-bench generate dataset \ --tokenizer ./tokenizer/Qwen3-32b \ --dataset-type random \ --output-path ./built_in_dataset \ --input-length 128 \ --num-requests 100
- Generate an embedding dataset.
$ acs-bench generate dataset \ --tokenizer ./tokenizer/Qwen3-32b \ --task embedding \ --output-path ./built_in_dataset \ --input-length 128 \ --num-requests 100
- Generate a reranking dataset.
$ acs-bench generate dataset \ --tokenizer ./tokenizer/Qwen3-32b \ --task rerank \ --document-size 4 \ --output-path ./built_in_dataset \ --input-length 128 \ --num-requests 100
- Filter datasets from LongBench.
$ acs-bench generate dataset \ --tokenizer ./tokenizer/Qwen3-32b \ --dataset-type LongBench \ --input-path ./dataset/long_bench --output-path ./built_in_dataset \ --input-length 128 \ --num-requests 100
- Filter datasets from ShareGPT.
$ acs-bench generate dataset \ --tokenizer ./tokenizer/Qwen3-32b \ --dataset-type ShareGPT \ --input-path ./dataset/ShareGPT --output-path ./built_in_dataset \ --input-length 128 \ --num-requests 100
- LongBench download link: https://huggingface.co/datasets/zai-org/LongBench/tree/main
- ShareGPT download link: https://huggingface.co/datasets/shibing624/sharegpt_gpt4
The --input-length and --num-requests parameters in the dataset generation command only support single values.
If you need to generate datasets with different specifications, modify the --input-length or --num-requests parameters to the desired values and then execute the command.
Dataset Generation Parameter Description
The command to query the dataset generation parameters is as follows:
$ acs-bench generate dataset -h
|
Parameter |
Type |
Mandatory |
Description |
|---|---|---|---|
|
-dt/--dataset-type |
String |
No |
Specifies the source of the dataset to be generated, which is the type of open-source dataset used for data filtering. The default value is random, which means a random token combination mode. |
|
-i/--input-path |
String |
No |
Specifies the path to the open-source dataset used for data filtering. This parameter is not required when --dataset-type is set to random. |
|
-mt/--modal-type |
String |
No |
The mode type for multimodal datasets. The default value is text. Current options are text, image-text, and video-text. |
|
-tk/--task |
String |
No |
The task backend for the dataset. The default value is generate. Current options are generate, rerank, and embedding. |
|
-cfg/--config-option |
String |
No |
Multimodal configuration options, specified in the form of "KEY:VALUE" pairs. Multiple "KEY:VALUE" pairs can be provided. Allowed key options include image_height, image_width, duration, and fps. |
|
-o/--output-path |
String |
Yes |
The output path for the generated JSON file containing prompts. |
|
-il/ --input-length |
Int |
Yes |
The length of each prompt in the custom dataset. |
|
-pl/--prefix-length |
Int |
No |
The length of the common prefix prompt in the custom dataset, effective only in random mode. The default is 0. |
|
-n/--num-requests |
Int |
Yes |
The number of prompts to be generated. |
|
-ds/--document-size |
Int |
No |
The document size for each query. The default is 4. |
|
-t/ --tokenizer |
String |
Yes |
The path to the tokenizer model folder, supporting both local paths and Hugging Face model paths. |
|
-rv/--revision |
String |
No |
Specifies the model branch in the Hugging Face community, applicable only when the tokenizer is a Hugging Face model path. The default is master. |
|
-ra/--range-ratio-above |
Float |
No |
The ratio by which the prompt length can dynamically increase. The maximum length is input_length x (1 + range_ratio_above). The value range is [0, 1]. The default is 0.0. |
|
-rb/--range-ratio-below |
Float |
No |
The ratio by which the prompt length can dynamically decrease. The minimum length is input_length x (1 – range_ratio_above). The value range is [0, 1]. The default is 0.0. |
|
-seed/--random-seed |
Int |
No |
The random seed used to fix randomness. |
|
-trc/--trust-remote-code |
Bool |
No |
Specifies whether to trust remote code, applicable only when the tokenizer is a Hugging Face model path. The default is False. |
Performance Stress Testing Mode Verification
# Use a thread pool for concurrent testing. The default backend concurrency mode is threading-pool. You can also choose the asynchronous coroutine concurrency mode asyncio, the multi-process mode processing-pool, or the multi-thread mode threading-pool. $ acs-bench prof \ --provider ./provider/providers.yaml \ --dataset-type custom --input-path ./built_in_dataset/ \ --concurrency-backend threading-pool \ --backend openai-chat --warmup 1 \ --epochs 2 \ --num-requests 1,2,4,8 --concurrency 1,2,4,8 \ --input-length 128,128,2048,2048 --output-length 128,2048,128,2048 \ --benchmark-save-path ./output_path/
Ramp-Up Mode Verification
# Example using the multi-thread concurrency mode, starting with a concurrency of 1 and increasing by 2 every 5,000ms. $ acs-bench prof \ --provider ./provider/providers.yaml \ --dataset-type custom --input-path ./built_in_dataset/ \ --concurrency-backend threading-pool \ --backend openai-chat --warmup 1 \ --epochs 2 \ --use-climb --climb-mode linear --growth-rate 2 --init-concurrency 1 --growth-interval 5000 \ --num-requests 1,2,4,8 --concurrency 1,2,4,8 \ --input-length 128,128,2048,2048 --output-length 128,2048,128,2048 \ --benchmark-save-path ./output_path/
Concurrent Testing of Embedding Models Using an Embedding Dataset
An example of using the acs-bench prof command for concurrent testing of an embedding model with an embedding dataset is shown below. For details about parameter descriptions, see Parameter Descriptions for Usage Example. For details about output artifact descriptions, see Artifact Description.
# Example using multi-thread concurrency mode, with the backend set to embedding $ acs-bench prof \ --provider ./provider/providers.yaml \ --dataset-type custom --input-path ./built_in_dataset/ \ --concurrency-backend threading-pool \ --backend embedding --warmup 1 \ --epochs 2 \ --num-requests 1,2,4,8 --concurrency 1,2,4,8 \ --input-length 128,128,2048,2048 --output-length 128,2048,128,2048 \ --benchmark-save-path ./output_path/
Concurrent Testing of Reranking Models Using a Rerank Dataset
An example of using the acs-bench prof command for concurrent testing of a reranking model with an embedding dataset is shown below. For details about parameter descriptions, see Parameter Descriptions for Usage Example. For details about output artifact descriptions, see Artifact Description.
# Example using multi-thread concurrency mode, with the backend set to rerank $ acs-bench prof \ --provider ./provider/providers.yaml \ --dataset-type custom --input-path ./built_in_dataset/ \ --concurrency-backend threading-pool \ --backend rerank --warmup 1 \ --document-size 4,4,4,4 \ --epochs 2 \ --num-requests 1,2,4,8 --concurrency 1,2,4,8 \ --input-length 128,128,2048,2048 --output-length 128,2048,128,2048 \ --benchmark-save-path ./output_path/
Parameter Descriptions for Usage Example
1. The parameters --concurrency, --init-concurrency, --num-requests, --input-length, --output-length, and --config-option can be specified using either a comma-separated list (without spaces between commas) or by specifying multiple groups of parameters separately. If the --num-requests parameter is not specified, it defaults to the same value as the --concurrency parameter.
Example:
$ acs-bench prof \ --provider ./provider/providers.yaml \ --dataset-type custom --input-path ./built_in_dataset/ \ --concurrency-backend threading-pool \ --backend openai-chat --warmup 1 \ --epochs 2 \ --num-requests 1 --num-requests 2 --num-requests 4 --num-requests 8 \ --concurrency 1 --concurrency 2 --concurrency 4 --concurrency 8 \ --input-length 128 --input-length 128 --input-length 2048 --input-length 2048 \ --output-length 128 --output-length 2048 --output-length 128 --output-length 2048 \ --benchmark-save-path ./output_path/
2. The --input-length parameter values in the performance stress testing and ramp-up testing examples must exist in the pre-generated dataset. If they do not exist, refer to Obtaining Datasets to generate a dataset with the corresponding input length.
The performance benchmark test parameters are primarily composed of four parts: Dataset Options, Concurrency Options, Metrics Options, and Serving Options. The following sections will introduce these configuration types.
$ acs-bench prof -h
|
Parameter |
Type |
Mandatory |
Description |
|---|---|---|---|
|
-dt/--dataset-type |
String |
No |
Specifies the type of dataset, the default value is custom, which means a user-defined dataset. |
|
-cfg/--config-option |
String |
No |
Multimodal configuration options, specified in the form of "KEY:VALUE" pairs. Multiple "KEY1:VALUE1,KEY2:VALUE2" pairs can be provided. Allowed key options include image_height, image_width, duration, and fps. |
|
-mt/--modal-type |
String |
No |
The mode type for multimodal datasets. The default value is text. Current options are text, image-text, and video-text. |
|
-i/--input-path |
String |
Yes |
Specifies the path for the dataset. |
|
-il/--input-length |
Int |
Yes |
Specifies the length of the custom dataset, only effective when dataset-type is custom, supports specifying multiple input lengths, separated by ",". |
|
-ds/--document-size |
Int |
No |
The document size for each query. This parameter supports multiple integer inputs. |
|
-n/--num-requests |
Int |
Yes |
The number of requests for concurrent testing. This parameter supports specifying multiple request numbers, separated by ",". It defaults to the same as the concurrency number. |
|
-t/--tokenizer |
String |
No |
The path to the tokenizer model folder, supporting both local paths and Hugging Face model paths. |
|
-rv/--revision |
String |
No |
Specifies the model branch in the Hugging Face community, applicable only when the tokenizer is a Hugging Face model path. The default is master. |
|
-seed/--random-seed |
Int |
No |
The random seed used to fix randomness. |
|
-trc/--trust-remote-code |
Bool |
No |
Specifies whether to trust remote code, applicable only when the tokenizer is a Hugging Face model path. The default is False. |
|
Parameter |
Type |
Mandatory |
Description |
|---|---|---|---|
|
-c/--concurrency |
Int |
No |
Maximum concurrency level. The default value is 1. The parameter supports specifying multiple concurrency levels, separated by ",". |
|
-nc/--num-process |
Int |
No |
Number of processes for parallel processing, which should be less than or equal to the number of CPUs. This parameter supports specifying multiple process numbers, separated by ",". The default value is [1]. |
|
-r/--request-rate |
Float |
No |
Request arrival rate, only effective when concurrency is 1. The default value is infinity (INF). |
|
-rm/--request-mode |
String |
No |
Request arrival mode, which supports normal and pd-adaptive. The default value is normal. |
|
-pc/--prefill-concurrency |
INT |
No |
Maximum concurrency for all prefill operations in PD aggregation, only effective when --request-mode is pd-adaptive. |
|
-dc/--decoder-concurrency |
INT |
No |
Maximum concurrency for all decode operations in PD aggregation, only effective when --request-mode is pd-adaptive. |
|
-burst/--burstiness |
Float |
No |
Burst factor for requests, only effective when request_rate is not inf. The default value is 1.0. |
|
-cb/--concurrency-backend |
Str |
No |
Concurrency backend, which defaults to threading-pool. Supported options:
|
|
-ub/--use-climb |
Bool |
No |
Specifies whether to enable the ramp-up mode. The default value is False, indicating that the ramp-up mode is not enabled. |
|
-gr/--growth-rate |
Int |
No |
Concurrency growth rate for each ramp-up, only effective in ramp-up mode. The default value is 0. |
|
-gi/--growth-interval |
Float |
No |
Time interval for each ramp-up, only effective in ramp-up mode. The default value is 1,000 ms. |
|
-ic/--init-concurrency |
Int |
No |
Initial concurrency level, only effective in ramp-up mode. The default value is equal to concurrency. This parameter supports specifying multiple initial concurrency levels, separated by ",". |
|
-cm/--climb-mode |
String |
No |
Ramp-up mode, only effective in ramp-up mode. The default value is linear. Supported options:
|
|
Parameter |
Type |
Mandatory |
Description |
|---|---|---|---|
|
-g, --goodput |
String |
No |
Service SLO, indicating performance metrics that meet business requirements. The unit is milliseconds (ms). Supported metric types: ttft, tpot, and e2el. You can use -g ttft:50 -g e2e:1000 to specify ttft and e2e metrics. |
|
-bi, --bucket-interval |
Float |
No |
Indicates the sampling interval for real-time performance metrics. The unit is ms. If this parameter is specified, it can dynamically monitor the changes in performance metrics within the bucket_interval ms. |
|
Parameter |
Type |
Mandatory |
Description |
|---|---|---|---|
|
-b, --backend |
String |
No |
Type of request service API, which can be openai, openai-chat, embedding, or rerank. The default value is openai-chat. |
|
-p, --provider |
String |
Yes |
Path to the provider file, which needs to be created and specified by you. |
|
-pid, --provider-id |
String |
No |
Specifies the provider ID to be tested, useful when the provider file contains multiple configurations and only a specific ID needs to be run. |
|
-ol/--output-length |
Int |
Yes |
Length of output tokens. You can specify multiple output lengths, separated by ",". |
|
-ra/--range-ratio-above |
Float |
No |
The ratio by which the output token length can dynamically increase. The maximum length is output_length x (1 + range_ratio_above). The value range is [0, 1]. The default is 0.0. |
|
-rb/--range-ratio-below |
Bool |
No |
The ratio by which the output token length can dynamically decrease. The minimum length is output_length x (1 – range_ratio_above). The value range is [0, 1]. The default is 0.0. |
|
-w/--warmup |
Int |
No |
Number of warmup requests. The default value is 0, indicating that warmup is not enabled. |
|
-e/--epochs |
Int |
No |
Number of times to run the same concurrency configuration. The default value is 1, indicating that each concurrency group runs only once. |
|
-tk/--top-k |
Int |
No |
Top-k sampling parameter, only effective for OpenAI-compatible backends. The default value is -1. |
|
-tp/--top-p |
Float |
No |
Top-p sampling parameter, only effective for OpenAI-compatible backends. The default value is 1.0. |
|
-mp/--min-p |
Float |
No |
Min-p sampling parameter, indicating the minimum probability for a token to be considered. Must be in the range [0, 1], only effective for OpenAI-compatible backends. |
|
-temper/--temperature |
Float |
No |
Temperature sampling parameter. The default value is 0. |
|
-cs/--chunk-size |
Int |
No |
Chunk size in stream requests. The default value is 1024. |
|
-ef/--encoding-format |
String |
No |
Encoding format for the backend's response. The value can be float or base64. The default value is float. |
|
-usd/--use-spec-decode |
Bool |
No |
Indicates whether speculative inference is enabled on the server. If enabled, it can be combined with the --num-spec-tokens parameter to calculate the MTP acceptance rate. The default value is False, indicating that speculative inference is disabled. |
|
-nst/--num-spec-tokens |
Int |
No |
Number of speculative inference tokens configured on the server. A value of 1 indicates that the server will infer one additional token each time, which can be combined with the --use-spec-decode parameter to calculate the MTP acceptance rate. The default value is -1. |
|
-umar/--use-mtp-accept-rate |
Bool |
No |
Indicates whether to ignore the number of tokens generated by the model when calculating the MTP acceptance rate. The default value is True, indicating that the number of tokens generated by the model is ignored. |
|
-nss/--num-scheduler-steps |
Int |
No |
Size of multi step on the server, used to calculate the MTP acceptance rate. The default value is 1. |
|
-timeout/--timeout |
Float |
No |
Request timeout time. The default value is 1,000s. |
|
-ie/--ignore-eos |
Bool |
No |
Specifies whether to ignore EOS. The default value is True, indicating that EOS is ignored. |
|
-cus/--continuous-usage-stats |
Bool |
No |
Specifies whether to include usage information in each returned chunk in stream requests. The default value is True, indicating that usage information is included in each returned chunk. |
|
-sst/--skip-special-tokens |
Bool |
No |
Specifies whether to skip special tokens. The default value is False. |
|
-er/--enable-max-tokens-exclude-reasoning |
Bool |
No |
Specifies whether to enable max-tokens excluding reasoning, which will proactively disconnect from the server when max-tokens is reached. The default value is True. |
|
-pf/ --profile |
Bool |
No |
Specifies whether to collect Service Profiler information from the server. The default value is False, indicating that the information is not collected. The warmup phase does not collect server service information. |
|
-pl/--profile-level |
String |
No |
Collection level for the server's Service Profiler. The options include Level_none, Level0, Level1, and Level2. This parameter is only effective when profiling is enabled; its default value is none. |
|
-trace/--trace |
Bool |
No |
Specifies whether to enable the tool's trace switch, which will monitor and display the concurrency process. The default value is False, indicating that it is not enabled. |
|
-s/--benchmark-save-path |
String |
No |
Path to the folder where performance metrics are saved. The default value is ./benchmark_output. |
Artifact Description
After the script runs, a requests directory and a summary CSV file will be created in the output path specified by the --benchmark-save-path parameter. Inside the requests directory, a CSV file starting with requests will be output.
1.requests_{provider}_{dataset_type}_{control_method}_concurrency{concurrency}_{concurrency_backend}_input{input_length}_output{output_length}_{current_time}.csv
2.summary_{provider}_{control_method}_{concurrency_backend}_{current_time}.csv
An example is shown in the figure below.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot