Updated on 2025-11-04 GMT+08:00

Ascend-vLLM Inference FAQs

Issue 1: NPU Out of Memory Occurred During Inference and Prediction

Solution: Adjust the GPU memory utilization when starting the inference service by reducing the value of --gpu-memory-utilization.

Issue 2: ValueError:User-specified max_model_len is greater than the drived max_model_len Occurred During Inference and Prediction

Solution:

export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1

This allows passing a value greater than the maximum sequence length specified in the model's config.json.

Issue 3: Poor Performance or Accuracy Issues When Using Offline Inference

Solution: Set the block_size to 128.

from vllm import LLM, SamplingParams
llm = LLM(model="facebook/opt-125m", block_size=128)