Ascend-vLLM Inference FAQs
Issue 1: NPU Out of Memory Occurred During Inference and Prediction
Solution: Adjust the GPU memory utilization when starting the inference service by reducing the value of --gpu-memory-utilization.
Issue 2: ValueError:User-specified max_model_len is greater than the drived max_model_len Occurred During Inference and Prediction
Solution:
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
This allows passing a value greater than the maximum sequence length specified in the model's config.json.
Issue 3: Poor Performance or Accuracy Issues When Using Offline Inference
Solution: Set the block_size to 128.
from vllm import LLM, SamplingParams llm = LLM(model="facebook/opt-125m", block_size=128)
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot