Introduction to Ascend-vLLM
Overview
vLLM is a well-known GPU-based framework for foundation model inference. Its popularity stems from features like continuous batching and PageAttention. Additionally, vLLM supports speculative inference and automatic prefix caching, making it valuable for both academic and industrial applications.
Ascend-vLLM is an inference framework optimized for NPUs. It inherits the advantages of vLLM with boosting the performance and usability through specific optimizations. This makes running foundation models on NPUs more efficient and convenient, significantly enhancing user convenience and performance. Ascend-vLLM can be extensively applied across various foundation model inference tasks, particularly in scenarios requiring high performance and efficiency, such as natural language processing and multimodal understanding.
Highlights
- Ease of use: Ascend-vLLM makes deploying and running foundation models simpler for developers.
- Easy development: An intuitive interface makes it easier to develop and debug models, making adjustments and optimizations straightforward.
- High performance: Huawei's advanced features and optimizations for NPUs boost efficiency through techniques like PD aggregation, preprocessing, postprocessing, and sampling.
Supported Features
|
Feature |
Description |
|
|---|---|---|
|
Scheduling |
Page-attention |
Manages KV cache in chunks to improve throughput. |
|
Continuous batching |
Iterative scheduling and dynamic batch adjustment reduce the latency and improve the throughput. |
|
|
Quantification |
W4A16-AWQ |
INT4 weight quantization cuts down on video memory usage and speeds up processing. It boosts performance for low-concurrency tasks by 80% with minimal accuracy loss under 2%. |
|
W8A8-SmoothQuant |
Int8 weight quantization cuts video memory usage, boosts throughput by 30%, and limits precision loss to under 1.5%. |
|
|
Efficient decoding |
Auto-prefix-caching |
Caching the prefix speeds up generating the first token. This helps significantly when using lengthy system prompts or handling multi-turn conversations. |
|
Chunked-prefill |
It is also called SplitFuse, which enhances resource efficiency and boosts performance by combining full and incremental inference. |
|
|
Speculative Decoding |
It boosts inference speed by supporting both large and small model speculation and eager-mode speculative execution. |
|
|
Graph mode |
ascend-turbo-graph |
It tracks how operators depend on each other during execution, removes Python host latency, and handles dynamic shapes. By default, it uses acl_graph if no setting is provided. If unsupported, it switches back to eager mode. |
|
acl-graph |
It compares performance with cuda-graph's piece-wise graph. When eager mode is enabled, it takes precedence. |
|
|
Output control |
Guided Decoding |
It controls model outputs in a specific mode. |
|
Beam search |
It outputs multiple candidate results through beam search. |
|
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot