Prefix Caching
What Is Prefix Caching?
In LLM inference applications, scenarios with long system prompts and multi-turn dialogues are often encountered. In the scenario with long system prompts, the system prompt remains the same across different requests, and the KV Cache computation is also the same. In the multi-turn dialogue scenario, each round of dialogue depends on the context of all previous rounds, and the KV Cache from the historical rounds needs to be recalculated in each subsequent round. In both cases, if the system prompt and the KV Cache from the historical rounds can be saved and reused for subsequent requests, it would significantly reduce the time to first token (TTFT). If both the Prefix Cache and the Generated KV Cache can be cached, in multi-turn dialogue applications, ignoring edge cases, it can be considered that the recompute of the historical rounds' generated dialogue is essentially eliminated.
Ascend-vLLM provides the key feature of prefix caching, which can significantly reduce TTFT in scenarios with long system prompts and multi-turn dialogues, enhancing user experience. Its advantages mainly include:
- Shorter prefill time: Since the KV cache corresponding to repeated token sequences across requests can be reused, this can reduce the time spent on computing the KV cache for some prefix tokens, thereby reducing the prefill time.
- More efficient memory usage: When the requests being processed have common prefixes, the KV cache of the common prefix part can be shared, avoiding the need to occupy multiple portions of memory repeatedly.
Constraints
- This feature cannot be used together with ascend_scheduler_config.
- Chunked prefill is enabled by default and can be disabled only after ascend_scheduler_config takes effect. Therefore, when prefix caching takes effect, chunked prefill also takes effect.
- Multimodal models do not support prefix caching.
- The Qwen2.5 and Qwen3 models support this feature.
- The KV cache of the public prefix token is reused only when the number of cross-request public prefix tokens is greater than or equal to the block size in PagedAttention.
Prefix caching parameter settings
Table 1 describes the supplementary parameters to be set for using prefix caching when the inference service is started. Table 2 shows the code example.
|
Service Startup Method |
Configuration Item |
Type |
Range |
Description |
|---|---|---|---|---|
|
offline |
enable_prefix_caching |
bool |
|
|
|
online |
--no-enable-prefix-caching |
- |
- |
Note: Enabling prefix caching is specified when starting the service and is an action type parameter. |
|
Service Startup Method |
API |
Service Startup Base Command |
|---|---|---|
|
offline |
- |
LLM(model="facebook/opt-125m", enable_prefix_caching=True) |
|
online (No configuration needed; it is enabled by default) |
openai |
python -m vllm.entrypoints.openai.api_server \ --model=facebook/opt-125m |
Inference Execution Reference
- Configure the service parameters. To use this feature in Ascend-vLLM, see Table 1 and Table 2. For details about other parameters, see Starting an LLM-powered Inference Service.
- Start the service. For details, see Starting an LLM-powered Inference Service.
- Evaluate the accuracy and performance. For details, see Inference Service Accuracy Evaluation and Inference Service Performance Evaluation.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot