Prefix Caching

What Is Prefix Caching?

In LLM inference applications, scenarios with long system prompts and multi-turn dialogues are often encountered. In the scenario with long system prompts, the system prompt remains the same across different requests, and the KV Cache computation is also the same. In the multi-turn dialogue scenario, each round of dialogue depends on the context of all previous rounds, and the KV Cache from the historical rounds needs to be recalculated in each subsequent round. In both cases, if the system prompt and the KV Cache from the historical rounds can be saved and reused for subsequent requests, it would significantly reduce the time to first token (TTFT). If both the Prefix Cache and the Generated KV Cache can be cached, in multi-turn dialogue applications, ignoring edge cases, it can be considered that the recompute of the historical rounds' generated dialogue is essentially eliminated.

Ascend-vLLM provides the key feature of prefix caching, which can significantly reduce TTFT in scenarios with long system prompts and multi-turn dialogues, enhancing user experience. Its advantages mainly include:

Shorter prefill time: Since the KV cache corresponding to repeated token sequences across requests can be reused, this can reduce the time spent on computing the KV cache for some prefix tokens, thereby reducing the prefill time.
More efficient memory usage: When the requests being processed have common prefixes, the KV cache of the common prefix part can be shared, avoiding the need to occupy multiple portions of memory repeatedly.

Constraints

This feature cannot be used together with ascend_scheduler_config.
Chunked prefill is enabled by default and can be disabled only after ascend_scheduler_config takes effect. Therefore, when prefix caching takes effect, chunked prefill also takes effect.
Multimodal models do not support prefix caching.
The Qwen2.5 and Qwen3 models support this feature.
The KV cache of the public prefix token is reused only when the number of cross-request public prefix tokens is greater than or equal to the block size in PagedAttention.

Prefix caching parameter settings

Table 1 describes the supplementary parameters to be set for using prefix caching when the inference service is started. Table 2 shows the code example.

**Table 1** Prefix caching parameters
Service Startup Method	Configuration Item	Type	Range	Description
offline	enable_prefix_caching	bool	True False	True: Enables prefix caching. If this parameter is not configured, it will default to True. False: Does not enable prefix caching.
online	--no-enable-prefix-caching	-	-	The current version defaults to enabling prefix caching. This configuration item is used to disable prefix caching. Note: Enabling prefix caching is specified when starting the service and is an action type parameter.

**Table 2** Code examples for enabling prefix caching
Service Startup Method	API	Service Startup Base Command
offline	-	LLM(model="facebook/opt-125m", enable_prefix_caching=True)
online (No configuration needed; it is enabled by default)	openai	python -m vllm.entrypoints.openai.api_server \ --model=facebook/opt-125m