Updated on 2025-11-04 GMT+08:00

Introduction to Ascend-vLLM

Overview

vLLM is a well-known GPU-based framework for foundation model inference. Its popularity stems from features like continuous batching and PageAttention. Additionally, vLLM supports speculative inference and automatic prefix caching, making it valuable for both academic and industrial applications.

Ascend-vLLM is an inference framework optimized for NPUs. It inherits the advantages of vLLM with boosting the performance and usability through specific optimizations. This makes running foundation models on NPUs more efficient and convenient, significantly enhancing user convenience and performance. Ascend-vLLM can be extensively applied across various foundation model inference tasks, particularly in scenarios requiring high performance and efficiency, such as natural language processing and multimodal understanding.

Highlights

  1. Ease of use: Ascend-vLLM makes deploying and running foundation models simpler for developers.
  2. Easy development: An intuitive interface makes it easier to develop and debug models, making adjustments and optimizations straightforward.
  3. High performance: Huawei's advanced features and optimizations for NPUs boost efficiency through techniques like PD aggregation, preprocessing, postprocessing, and sampling.

Supported Features

Table 1 Features supported by Ascend-vLLM

Feature

Description

Scheduling

Page-attention

Manages KV cache in chunks to improve throughput.

Continuous batching

Iterative scheduling and dynamic batch adjustment reduce the latency and improve the throughput.

Quantification

W4A16-AWQ

INT4 weight quantization cuts down on video memory usage and speeds up processing. It boosts performance for low-concurrency tasks by 80% with minimal accuracy loss under 2%.

W8A8-SmoothQuant

Int8 weight quantization cuts video memory usage, boosts throughput by 30%, and limits precision loss to under 1.5%.

Efficient decoding

Auto-prefix-caching

Caching the prefix speeds up generating the first token. This helps significantly when using lengthy system prompts or handling multi-turn conversations.

Chunked-prefill

It is also called SplitFuse, which enhances resource efficiency and boosts performance by combining full and incremental inference.

Speculative Decoding

It boosts inference speed by supporting both large and small model speculation and eager-mode speculative execution.

Graph mode

ascend-turbo-graph

It tracks how operators depend on each other during execution, removes Python host latency, and handles dynamic shapes. By default, it uses acl_graph if no setting is provided. If unsupported, it switches back to eager mode.

acl-graph

It compares performance with cuda-graph's piece-wise graph. When eager mode is enabled, it takes precedence.

Output control

Guided Decoding

It controls model outputs in a specific mode.

Beam search

It outputs multiple candidate results through beam search.