Ascend-vLLM Inference FAQs

Solution: Adjust the GPU memory utilization when starting the inference service by reducing the value of --gpu-memory-utilization.

Solution:

export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1

This allows passing a value greater than the maximum sequence length specified in the model's config.json.

Solution: Set the block_size to 128.

from vllm import LLM, SamplingParams
llm = LLM(model="facebook/opt-125m", block_size=128)

Parent topic: Appendix

Thank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.

The system is busy. Please try again later.

Which of the following issues have you encountered?

Content is inconsistent with the product UI

Unclear descriptions

Lack of examples or code

Incorrect steps

Can't find what I need

Lack of best practices

Feedback (optional)

0/500

Select at least one type of issue, and enter your comments or suggestions.

Enter a maximum of 500 characters.

Submit Cancel

For any further questions, feel free to contact us through the chatbot.