Introduction to Ascend-vLLM

Overview

vLLM is a well-known GPU-based framework for foundation model inference. Its popularity stems from features like continuous batching and PageAttention. Additionally, vLLM supports speculative inference and automatic prefix caching, making it valuable for both academic and industrial applications.

Ascend-vLLM is an inference framework optimized for NPUs. It inherits the advantages of vLLM with boosting the performance and usability through specific optimizations. This makes running foundation models on NPUs more efficient and convenient, significantly enhancing user convenience and performance. Ascend-vLLM can be extensively applied across various foundation model inference tasks, particularly in scenarios requiring high performance and efficiency, such as natural language processing and multimodal understanding.

Highlights

Ease of use: Ascend-vLLM makes deploying and running foundation models simpler for developers.
Easy development: An intuitive interface makes it easier to develop and debug models, making adjustments and optimizations straightforward.
High performance: Huawei's advanced features and optimizations for NPUs boost efficiency through techniques like PD aggregation, preprocessing, postprocessing, and sampling.

Supported Features

**Table 1** Features supported by Ascend-vLLM
Feature		Description
Scheduling	Page-attention	Manages KV cache in chunks to improve throughput.
Scheduling	Continuous batching	Iterative scheduling and dynamic batch adjustment reduce the latency and improve the throughput.
Quantification	W4A16-AWQ	INT4 weight quantization cuts down on video memory usage and speeds up processing. It boosts performance for low-concurrency tasks by 80% with minimal accuracy loss under 2%.
Quantification	W8A8-SmoothQuant	Int8 weight quantization cuts video memory usage, boosts throughput by 30%, and limits precision loss to under 1.5%.
Efficient decoding	Auto-prefix-caching	Caching the prefix speeds up generating the first token. This helps significantly when using lengthy system prompts or handling multi-turn conversations.
	Chunked-prefill	It is also called SplitFuse, which enhances resource efficiency and boosts performance by combining full and incremental inference.
	Speculative Decoding	It boosts inference speed by supporting both large and small model speculation and eager-mode speculative execution.
Graph mode	ascend-turbo-graph	It tracks how operators depend on each other during execution, removes Python host latency, and handles dynamic shapes. By default, it uses acl_graph if no setting is provided. If unsupported, it switches back to eager mode.
Graph mode	acl-graph	It compares performance with cuda-graph's piece-wise graph. When eager mode is enabled, it takes precedence.
Output control	Guided Decoding	It controls model outputs in a specific mode.
Output control	Beam search	It outputs multiple candidate results through beam search.