Starting an Embedding or Reranking Model-powered Inference Service

This topic describes how to start an embedding or reranking model-powered inference service, including offline inference and online inference.

You must set these common framework environment variables for both offline and online inference.

The common framework environment variables are as follows:

# VPC CIDR block
# You need to manually change the values. For details, see the following instructions.
VPC_CIDR="7.150.0.0/16"  
VPC_PREFIX=$(echo "$VPC_CIDR" | cut -d'/' -f1 | cut -d'.' -f1-2)
POD_INET_IP=$(ifconfig | grep -oP "(?<=inet\s)$VPC_PREFIX\.\d+\.\d+" | head -n 1)
POD_NETWORK_IFNAME=$(ifconfig | grep -B 1 "$POD_INET_IP" | head -n 1 | awk '{print $1}' | sed 's/://')
echo "POD_INET_IP: $POD_INET_IP"
echo "POD_NETWORK_IFNAME: $POD_NETWORK_IFNAME" 
# Specify the NIC.
export GLOO_SOCKET_IFNAME=$POD_NETWORK_IFNAME
export TP_SOCKET_IFNAME=$POD_NETWORK_IFNAME
export HCCL_SOCKET_IFNAME=$POD_NETWORK_IFNAME
# Set this parameter in the multi-node scenario.
export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1

# Enable video memory optimization.
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
# Configure the expansion location of the communication algorithm to the AI Vector Core unit on the device side.
export HCCL_OP_EXPANSION_MODE=AIV
# Specify available PUs as required.
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 
# Specify CPU core binding as needed.
export CPU_AFFINITY_CONF=1
export LD_PRELOAD=/usr/local/lib/libjemalloc.so.2:${LD_PRELOAD}
# By default, the ACLGraph mode is enabled. Specify the plug-in as ascend.
export VLLM_PLUGINS=ascend
# Specify whether to use ACLGraph mode: set the parameter to 1 to enable, and 0 to disable.
export USE_ACLGRAPH=1
# Set the vLLM backend version to v0.
export VLLM_USE_V1=0
# Specify the vLLM version.
export VLLM_VERSION=0.9.0

Offline Inference

Edit an embeddings Python script containing this code. Execute the script to run offline model inference using ascend-vllm.

from vllm import LLM, SamplingParams

def main():
    prompts = [
        "Hello, my name is",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]

    model_path = "/path/to/model"
    llm = LLM(
        model=model_path,
        tensor_parallel_size=2,
        block_size=128,
        max_num_seqs=256,
        max_model_len=8192,
        distributed_executor_backend='ray'
    )

    outputs = llm.encode(prompts)

    # Print the outputs.
    for output in outputs:
        print(output.outputs.embedding)

if __name__=="__main__":
    main()

Starting a Real-Time Inference Service

This section uses the OpenAI service API. For details, see https://docs.vllm.ai/en/latest/getting_started/quickstart.html.

Use the OpenAI API for inference. Below are the commands for single-node with one PU or multiple PUs. Adjust settings using the provided parameters.

Inference service startup parameters (Qwen3-Reranker-0.6B is used as an example):

source /home/ma-user/AscendCloud/AscendTurbo/set_env.bash

python -m vllm.entrypoints.openai.api_server \
--model /model/Qwen3-Reranker-0.6B \
--served-model-name Qwen3-Reranker-0.6B \
--max-num-seqs=256 \
--max-model-len=32768 \
--max-num-batched-tokens=32768 \
--tensor-parallel-size=1 \
--block-size=128 \
--host=0.0.0.0 \
--port=9001 \
--gpu-memory-utilization=0.95 \
--trust-remote-code

# hf_overrides needs to be set only for Qwen3 reranker.
# --hf_overrides '{"architectures": ["Qwen3ForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}'

The basic inference service parameters are as follows:

--model ${container_model_path}: path to the model weights within the container. It uses the Hugging Face directory format, which stores the model's weight files.
--max-num-seqs: maximum number of concurrent requests that can be processed. Requests exceeding this limit will wait in the queue.
--max-model-len: maximum number of input and output tokens during inference. The value of max-model-len must be less than that of seq_length in the config.json file. Otherwise, an error will be reported during inference and prediction. The config.json file is stored in the path corresponding to the model, for example, ${container_model_path}/chatglm3-6b/config.json. The maximum length varies among different models. For details, see Table 1.
--max-num-batched-tokens: maximum number of tokens that can be used in the prefill phase. The value must be greater than or equal to the value of --max-model-len. The value 4096 or 8192 is recommended.
--dtype: data type for model inference, which can be FP16 or BF16. float16 indicates FP16, and bfloat16 indicates BF16. If unspecified, the data type is auto-matched based on input data. Using different dtype may affect model precision. When using open-source weights, you are advised not to specify dtype and instead use the default dtype of the open-source weights.
--tensor-parallel-size: number of PUs you want to use. The product of model parallelism and pipeline parallelism must be the same as the number of started NPUs. For details, see Table 1. The value 1 indicates that the service is started using a single PU.
--pipeline-parallel-size: number of parallel pipelines. The product of model parallelism and pipeline parallelism must be the same as the number of started NPUs. The default value is 1. Currently, pipeline-parallel-size can only be set to 1.
--block-size: block size of kv-cache. The recommended value is 128.
--host=${docker_ip}: IP address for service deployment. Replace ${docker_ip} with the actual IP address of the host machine. The default value is None. For example, the parameter can be set to 0.0.0.0.
--port: port where the service is deployed.
--gpu-memory-utilization: ratio of the GPU memory used by the NPU. The input parameter name of the original vLLM is reused. The recommended value is 0.95.
--trust-remote-code: Specifies whether to trust the remote code.
--distributed-executor-backend: backend for launching multi-PU inference. The options are ray and mp, where ray indicates that Ray is used for multi-PU inference, and mp indicates that Python multi-processing is used for multi-PU inference. The default value is mp.
--disable-async-output-proc: Disables asynchronous post-processing, which reduces performance.
--no-enable-prefix-caching: Disables prefix caching. For details about how to enable prefix caching, see Prefix Caching.
--enforce-eager: If the INFER_MODE environment variable is not set, some models are started in ACLGraph mode by default to improve performance. After this parameter is set, the graph mode is disabled. You are advised to enable this function for non-Qwen series, such as Meta-Llama series.
--hf_overrides: Overrides the default settings of the Hugging Face model within the vLLM inference engine. Currently, only Qwen3 rerankers need to override some parameters (see the startup description).

Inference Request Test

Use commands to test if the inference service has started properly. For details about the parameter settings in the service startup command, see Starting a Real-Time Inference Service.

Run the inference test commands below to start the service via the OpenAI service API. Replace ${docker_ip} with the actual IP address of the host machine. The model parameter is mandatory. The service's served_model_name parameter must match its own value if provided. When the service lacks a served_model_name, it uses the model parameter's value instead, that is the model's weight location within the container.

Example reranking API:

curl -X POST http://${docker_ip}:8080/v1/rerank \
    -H "Content-Type: application/json" \
    -d '{
        "model": "/container_model/bge-reranker-v2-m3",
        "query": "What is the capital of France?",
        "documents": [
            "The capital of France is Paris",
            "Reranking is fun!",
            "vLLM is an open-source framework for fast AI serving"
        ]
    }'

**Table 1** Reranking service request parameters
Parameter	Mandatory	Default Value	Type	Description
model	Yes	None	Str	The service's served_model_name parameter must match its own value if provided. If the service does not provide served_model_name, the value is the same as the model parameter of the service.
query	Yes	None	Str	User query text
documents	Yes	None	Str	List of documents to be sorted (usually top-k results of embedding recall)

Use OpenAI to start the service (only V0 is supported). The following is an example of the embeddings API:

curl -X POST http://${docker_ip}:8080/v1/embeddings \
    -H "Content-Type: application/json" \
    -d '{
        "model": "/container_model/bge-base-en-v1.5",
        "input":"I love shanghai"
    }'

**Table 2** Embedding service request parameters
Parameter	Mandatory	Default Value	Type	Description
model	Yes	None	Str	The service's served_model_name parameter must match its own value if provided. If the service does not provide served_model_name, the value is the same as the model parameter of the service.
input	Yes	None	Str	Strings are supported.