Help Center/ ModelArts/ Best Practices/ LLM Inference/ Adapting Mainstream Open-Source Models to Ascend-vLLM for NPU Inference Based on Lite Server (New)/ Inference Service Deployment/ Starting an LLM-powered Inference Service

Updated on 2025-11-04 GMT+08:00

View PDF

Starting an LLM-powered Inference Service

This topic describes how to start an LLM-powered inference service, including offline inference and online inference.

You must set these common framework environment variables for both offline and online inference.

The common framework environment variables are as follows:

# VPC CIDR block
# You need to manually change the values. For details, see the following instructions.
VPC_CIDR="7.150.0.0/16"  
VPC_PREFIX=$(echo "$VPC_CIDR" | cut -d'/' -f1 | cut -d'.' -f1-2)
POD_INET_IP=$(ifconfig | grep -oP "(?<=inet\s)$VPC_PREFIX\.\d+\.\d+" | head -n 1)
POD_NETWORK_IFNAME=$(ifconfig | grep -B 1 "$POD_INET_IP" | head -n 1 | awk '{print $1}' | sed 's/://')
echo "POD_INET_IP: $POD_INET_IP"
echo "POD_NETWORK_IFNAME: $POD_NETWORK_IFNAME" 
# Specify the NIC.
export GLOO_SOCKET_IFNAME=$POD_NETWORK_IFNAME
export TP_SOCKET_IFNAME=$POD_NETWORK_IFNAME
export HCCL_SOCKET_IFNAME=$POD_NETWORK_IFNAME
# Set this parameter in the multi-node scenario.
export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1

# Enable video memory optimization.
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
# Configure the expansion location of the communication algorithm to the AI Vector Core unit on the device side.
export HCCL_OP_EXPANSION_MODE=AIV
# Specify available PUs as required.
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 
# Specify CPU core binding as needed.
export CPU_AFFINITY_CONF=1
export LD_PRELOAD=/usr/local/lib/libjemalloc.so.2:${LD_PRELOAD}
# ascend-turbo-graph is enabled by default; specify the plug-in as ascend_vllm, otherwise ascend is used.
export VLLM_PLUGINS=ascend_vllm
# Specify whether to use ACLGraph mode: set the parameter to 1 to enable, and 0 to disable.
export USE_ACLGRAPH=0
# Set the vLLM backend version to v1.
export VLLM_USE_V1=1
# Specify the vLLM version.
export VLLM_VERSION=0.9.0

Offline Inference

Edit a Python script containing this code. Execute the script to run offline model inference using ascend-vllm.

from vllm import LLM, SamplingParams

def main():
    prompts = [
        "Hello, my name is",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

    model_path = "/path/to/model"
    llm = LLM(
        model=model_path,
        tensor_parallel_size=2,
        block_size=128,
        max_num_seqs=256,
        max_model_len=8192,
        additional_config={
            "ascend_turbo_graph_config": {
                "enabled": True
            },
            "ascend_scheduler_config": {
                "enable": True,
                "chunked_prefill_enabled": False
            }
        },
        distributed_executor_backend='ray'
    )

    outputs = llm.generate(prompts, sampling_params)

    # Print the outputs.
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

if __name__=="__main__":
    main()

Starting a Real-Time Inference Service

This section uses the OpenAI service API. For details, see https://docs.vllm.ai/en/latest/getting_started/quickstart.html.

Use the OpenAI API for inference. Below are the commands for single-node with one PU or multiple PUs. Adjust settings using the provided parameters.

The table below shows key performance enhancement settings for Qwen series models, alongside common framework environment variables.

Qwen2, Qwen2.5, and Qwen3 LLMs

The Qwen MoE model cannot use the Qwen series optimization settings listed in Table 1.
The ACLGraph and eager modes cannot use the Qwen series optimization settings listed in Table 1.
The Qwen series W4A16 quantization models work exclusively with the AscendTurbo graph mode and cannot be configured using Qwen series optimization environment variables.

For Qwen2, Qwen2.5, or Qwen3 models, use the ascend-turbo-graph mode as the default setting in the inference service startup parameters. You must also configure specific environment variables for Qwen models.
Use eager mode for Meta-Llama or Llama-like models.

**Table 1** Environment variables for starting the Qwen Dense series
Variable	Description
ENABLE_QWEN_HYPERDRIVE_OPT	This function is disabled by default. The flashcomm communication optimization combined with the general fused operator optimization (TDynamicquant) applies to the entire Qwen model series. For BF16 scenarios, it must be used together with DISABLE_QWEN_DP_PROJ. W8A8 is not affected.
ENABLE_QWEN_MICROBATCH	This function is disabled by default. This enables micro-batch optimization and must be used together with ENABLE_QWEN_HYPERDRIVE_OPT. Applicable to all Qwen models in W8A8 and BF16 modes.
ENABLE_PHASE_AWARE_QKVO_QUANT	This function is disabled by default. At runtime, it introduces BF16 weights and performs mixed-precision quantized inference, which may increase memory usage. Applicable to all Qwen models in W8A8 mode; not supported in BF16 mode. Must be used together with ENABLE_QWEN_HYPERDRIVE_OPT.
DISABLE_QWEN_DP_PROJ	This function is disabled by default. When export ENABLE_QWEN_HYPERDRIVE_OPT=1 is set, DISABLE_QWEN_DP_PROJ takes effect. It disables full weight loading for mlp down_proj. Recommended to disable in Qwen BF16 scenarios; can be enabled in W8A8 scenarios.

Qwen Dense Models Supported	Environment Variable Example	Remarks
Qwen2 series – BF16 Qwen2.5 series – BF16 Qwen3 Dense series – BF16	export ENABLE_QWEN_HYPERDRIVE_OPT=1 export ENABLE_QWEN_MICROBATCH=1 export DISABLE_QWEN_DP_PROJ=1	-
Qwen2 series – W8A8 Qwen2.5 series – W8A8 Qwen3 Dense series – W8A8	export ENABLE_QWEN_HYPERDRIVE_OPT=1 export ENABLE_QWEN_MICROBATCH=1 export ENABLE_PHASE_AWARE_QKVO_QUANT=0 export DISABLE_QWEN_DP_PROJ=0	● export ENABLE_PHASE_AWARE_QKVO_QUANT=1 (set it to 1 only for Qwen3-32b-w8a8 tp8) ● When BF16 is used, set export DISABLE_QWEN_DP_PROJ=1 to disable weight copy optimization. ● When W8A8 is used, the default setting is DISABLE_QWEN_DP_PROJ=0. When the server log indicates prolonged GPU KV cache usage above 90% and a sharp rise in TTFT values, disable weight copy optimization by setting export DISABLE_QWEN_DP_PROJ=1.

Qwen Dense Models Supported

Environment Variable Example

Remarks

Qwen2 series – BF16

Qwen2.5 series – BF16

Qwen3 Dense series – BF16

export ENABLE_QWEN_HYPERDRIVE_OPT=1

export ENABLE_QWEN_MICROBATCH=1

export DISABLE_QWEN_DP_PROJ=1

Qwen2 series – W8A8

Qwen2.5 series – W8A8

Qwen3 Dense series – W8A8

export ENABLE_QWEN_HYPERDRIVE_OPT=1

export ENABLE_QWEN_MICROBATCH=1

export ENABLE_PHASE_AWARE_QKVO_QUANT=0

export DISABLE_QWEN_DP_PROJ=0

● export ENABLE_PHASE_AWARE_QKVO_QUANT=1 (set it to 1 only for Qwen3-32b-w8a8 tp8)

● When BF16 is used, set export DISABLE_QWEN_DP_PROJ=1 to disable weight copy optimization.

● When W8A8 is used, the default setting is DISABLE_QWEN_DP_PROJ=0. When the server log indicates prolonged GPU KV cache usage above 90% and a sharp rise in TTFT values, disable weight copy optimization by setting export DISABLE_QWEN_DP_PROJ=1.

Inference service startup parameters:

source /home/ma-user/AscendCloud/AscendTurbo/set_env.bash

python -m vllm.entrypoints.openai.api_server \
--model ${container_model_path} \
--max-num-seqs=256 \
--max-model-len=4096 \
--max-num-batched-tokens=4096 \
--tensor-parallel-size=1 \
--block-size=128 \
--host=${docker_ip} \
--port=8080 \
--gpu-memory-utilization=0.95 \
--trust-remote-code \
--no-enable-prefix-caching \
--additional-config='{"ascend_turbo_graph_config": {"enabled": true}, "ascend_scheduler_config": {"enabled": true}}'

The basic inference service parameters are as follows:

--model ${container_model_path}: path to the model weights within the container. It uses the Hugging Face directory format, which stores the model's weight files. If quantization is used, the weights converted in Quantization are used. If the address for converting the trained model into the Hugging Face format is used, the original tokenizer file is required.
--quantization, -q: weight quantization method taken from quantization_config in the model's configuration file. This parameter is required if the model uses quantization.
--max-num-seqs: maximum number of concurrent requests that can be processed. Requests exceeding this limit will wait in the queue.
--max-model-len: maximum number of input and output tokens during inference. The value of max-model-len must be less than that of seq_length in the config.json file. Otherwise, an error will be reported during inference and prediction. The config.json file is stored in the path corresponding to the model, for example, ${container_model_path}/chatglm3-6b/config.json. The maximum length varies among different models. For details, see Table 1.
--max-num-batched-tokens: maximum number of tokens that can be used in the prefill phase. The value must be greater than or equal to the value of --max-model-len. The value 4096 or 8192 is recommended.
--dtype: data type for model inference, which can be FP16 or BF16. float16 indicates FP16, and bfloat16 indicates BF16. If unspecified, the data type is auto-matched based on input data. Using different dtype may affect model precision. When using open-source weights, you are advised not to specify dtype and instead use the default dtype of the open-source weights.
--tensor-parallel-size: number of PUs you want to use. The product of model parallelism and pipeline parallelism must be the same as the number of started NPUs. For details, see Table 1. The value 1 indicates that the service is started using a single PU.
--pipeline-parallel-size: number of parallel pipelines. The product of model parallelism and pipeline parallelism must be the same as the number of started NPUs. The default value is 1. Currently, pipeline-parallel-size can only be set to 1.
--block-size: block size of kv-cache. The recommended value is 128.
--host=${docker_ip}: IP address for service deployment. Replace ${docker_ip} with the actual IP address of the host machine. The default value is None. For example, the parameter can be set to 0.0.0.0.
--port: port where the service is deployed.
--gpu-memory-utilization: ratio of the GPU memory used by the NPU. The input parameter name of the original vLLM is reused. The recommended value is 0.95.
--trust-remote-code: Specifies whether to trust the remote code.
--distributed-executor-backend: backend for launching multi-PU inference. The options are ray and mp, where ray indicates that Ray is used for multi-PU inference, and mp indicates that Python multi-processing is used for multi-PU inference. The default value is mp.
--disable-async-output-proc: Disables asynchronous post-processing, which reduces performance.
--speculative-config: speculative inference parameter, which is a JSON string. The default value is None.
--no-enable-prefix-caching: Disables prefix caching. For details about how to enable prefix caching, see Prefix Caching.
--enforce-eager: If the INFER_MODE environment variable is not set, some models are started in ACLGraph mode by default to improve performance. After this parameter is set, the graph mode is disabled. You are advised to enable this function for non-Qwen series, such as Meta-Llama series.
--additional-config: {"ascend_turbo_graph_config": {"enabled": true}} enables the ascend_turbo graph mode. If this function is enabled, the performance of the Qwen series is improved. If this function is disabled, acl_graph is used by default. Currently, acl_graph supports only BF16 and does not support compress_tensors, smoothquant, or awq.
"ascend_scheduler_config": {"enabled": true} is the configuration option of the scheduler.

Deploying an Inference Service on Multiple Nodes

If the GPU memory of a single node cannot accommodate the model weights, you may choose multi-node deployment. For this deployment method, it is required that the nodes are within the same cluster and the IP addresses between NPUs can be pinged successfully. The specific steps are as follows:

Run the following command on one of the nodes to view the IP address of the NPUs:
```
for i in $(seq 0 7);do hccn_tool -i $i -ip -g;done
```

Check whether the network connection between NPUs is normal.

# Run the following command on another node. 29.81.3.172 is the value of ipaddr obtained in the previous step.
hccn_tool -i 0 -ping -g address 29.81.3.172

Start the Ray cluster.

# Specify the communication NIC. Run the ifconfig command to find the NIC name that matches the host IP address.
export GLOO_SOCKET_IFNAME=enp67s0f5
export TP_SOCKET_IFNAME=enp67s0f5
export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1

# Specify the available NPUs.
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

# Set one node as the head node.
ray start --head --num-gpus=8

# Run the following command on other nodes:
ray start --address='10.170.22.18:6379' --num-gpus=8

--num-gpus: The value must be the same as the number of available NPUs specified by ASCEND_RT_VISIBLE_DEVICES.
--address: IP address and port number of the head node. This will be printed after the head node is successfully created.

Environment variables must be set on each node.
To update environment variables, you must restart the Ray cluster.

Choose a node, add the distributed backend setting --distributed-executor-backend=ray, and configure all other settings to match standard service launch conditions. For details, see the single-node scenario in Starting a Real-Time Inference Service.

Inference Request Test

Use commands to test if the inference service has started properly. For details about the parameter settings in the service startup command, see Starting a Real-Time Inference Service.

Run the inference test commands below to start the service via the OpenAI service API. Replace ${docker_ip} with the actual IP address of the host machine. When starting the service, if you omit the served-model-name parameter, ${container_model_path} should match the model parameter's value. If you include served-model-name, use its value instead of ${container_model_path}.

OpenAI Completions API with vLLM

curl http://${docker_ip}:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{        
      "model": "${container_model_path}",      
      "prompt": "hello",
      "max_tokens": 32,
      "temperature": 0   
}'

OpenAI Chat Completions API with vLLM

curl -X POST http://${docker_ip}:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "${container_model_path}",
    "messages": [
        {
            "role": "user",
            "content": "hello"
        }
    ],
    "max_tokens": 32,
    "temperature": 0
}'

The service APIs are identical to those on the vLLM official website. Key parameters are introduced here. For detailed parameter descriptions, refer to the official website at https://docs.vllm.ai/en/stable/api/vllm/vllm.sampling_params.html.

For details about the OpenAI service request parameters, see Table 2.

**Table 2** OpenAI service request parameters
Parameter	Mandatory	Default Value	Type	Description
model	Yes	None	Str	This parameter is mandatory for an inference request when the service is started through the OpenAI service API. The value must be the same as the value of model ${container_model_path} when the inference service is started. This parameter is not involved for an inference request when the service is started through the vLLM service API.
prompt	Yes	-	Str	Input question of the request.
max_tokens	No	16	Int	Maximum number of tokens to be generated for each output sequence.
top_k	No	-1	Int	Determines how many of the highest ranking tokens are considered. The value -1 indicates that all tokens are considered. Decreasing the value can reduce the sampling time.
top_p	No	1.0	Float	A floating point number that controls the cumulative probability of the first several tokens to be considered. The value must be in the range (0, 1]. The value 1 indicates that all tokens are considered.
temperature	No	1.0	Float	A floating-point number that controls the randomness of sampling. Lower values make the model more deterministic, while higher values make it more random. The value 0 indicates greedy sampling.
stop	No	None	None/Str/List	A list of strings used to stop generation. The output does not contain the stop strings. For example, if "Hello" and "Hi" are configured as the stop sequences, the model stops generating text when "Hello" or "Hi" is encountered.
stream	No	False	Bool	Controls whether to enable streaming inference. The default value is False, indicating that streaming inference is disabled.
n	No	1	Int	Multiple normal results are returned. Constraints: If beam_search is not used, the recommended value range of n is 1 ≤ n ≤10. When n exceeds 1, avoid using greedy_sample by setting top_k greater than 1 and keeping temperature above 0. If beam_search is used, the recommended value range of n is 1 < n ≤ 10. If n is 1, the inference request will fail. NOTE: For optimal performance, keep n at 10 or below. Large values of n can significantly slow down processing. Inadequate video RAM may cause inference requests to fail.
use_beam_search	No	False	Bool	Controls whether to use beam_search to replace sampling. Constraints: When this parameter is used, the following parameters must be set as required. n>1 top_p = 1.0 top_k = -1 temperature = 0.0 WARNING: When beam_search is used, max_tokens must be explicitly set to ensure that the request can be stopped as expected.
presence_penalty	No	0.0	Float	Applies rewards or penalties based on the presence of new words in the generated text. The value range is [-2.0, 2.0].
frequency_penalty	No	0.0	Float	Applies rewards or penalties based on the frequency of each word in the generated text. The value range is [-2.0,2.0].
length_penalty	No	1.0	Float	Imposes a larger penalty on longer sequences in a beam search process. To use length_penalty, you must include the following three parameters: set use_beam_search to true, set best_of to a value greater than 1, and fix top_k at -1. "top_k": -1 "use_beam_search":true "best_of":2
ignore_eos	No	False	Bool	Indicates whether to ignore EOS and continue to generate tokens.
guided_json	No	None	Union[str, dict, BaseModel]	Use OpenAI to start the service. If JSON Schema is required, set guided_json. For details, see Structured Outputs.