Updated on 2025-11-04 GMT+08:00

Prefix Caching

What Is Prefix Caching?

In LLM inference applications, scenarios with long system prompts and multi-turn dialogues are often encountered. In the scenario with long system prompts, the system prompt remains the same across different requests, and the KV Cache computation is also the same. In the multi-turn dialogue scenario, each round of dialogue depends on the context of all previous rounds, and the KV Cache from the historical rounds needs to be recalculated in each subsequent round. In both cases, if the system prompt and the KV Cache from the historical rounds can be saved and reused for subsequent requests, it would significantly reduce the time to first token (TTFT). If both the Prefix Cache and the Generated KV Cache can be cached, in multi-turn dialogue applications, ignoring edge cases, it can be considered that the recompute of the historical rounds' generated dialogue is essentially eliminated.

Ascend-vLLM provides the key feature of prefix caching, which can significantly reduce TTFT in scenarios with long system prompts and multi-turn dialogues, enhancing user experience. Its advantages mainly include:

  • Shorter prefill time: Since the KV cache corresponding to repeated token sequences across requests can be reused, this can reduce the time spent on computing the KV cache for some prefix tokens, thereby reducing the prefill time.
  • More efficient memory usage: When the requests being processed have common prefixes, the KV cache of the common prefix part can be shared, avoiding the need to occupy multiple portions of memory repeatedly.

Constraints

  • This feature cannot be used together with ascend_scheduler_config.
  • Chunked prefill is enabled by default and can be disabled only after ascend_scheduler_config takes effect. Therefore, when prefix caching takes effect, chunked prefill also takes effect.
  • Multimodal models do not support prefix caching.
  • The Qwen2.5 and Qwen3 models support this feature.
  • The KV cache of the public prefix token is reused only when the number of cross-request public prefix tokens is greater than or equal to the block size in PagedAttention.

Prefix caching parameter settings

Table 1 describes the supplementary parameters to be set for using prefix caching when the inference service is started. Table 2 shows the code example.

Table 1 Prefix caching parameters

Service Startup Method

Configuration Item

Type

Range

Description

offline

enable_prefix_caching

bool

  • True
  • False
  • True: Enables prefix caching. If this parameter is not configured, it will default to True.
  • False: Does not enable prefix caching.

online

--no-enable-prefix-caching

-

-

  • The current version defaults to enabling prefix caching.
  • This configuration item is used to disable prefix caching.

Note:

Enabling prefix caching is specified when starting the service and is an action type parameter.

Table 2 Code examples for enabling prefix caching

Service Startup Method

API

Service Startup Base Command

offline

-

LLM(model="facebook/opt-125m", enable_prefix_caching=True)

online (No configuration needed; it is enabled by default)

openai

python -m vllm.entrypoints.openai.api_server \
--model=facebook/opt-125m

Inference Execution Reference

  1. Configure the service parameters. To use this feature in Ascend-vLLM, see Table 1 and Table 2. For details about other parameters, see Starting an LLM-powered Inference Service.
  2. Start the service. For details, see Starting an LLM-powered Inference Service.
  3. Evaluate the accuracy and performance. For details, see Inference Service Accuracy Evaluation and Inference Service Performance Evaluation.