Chunked Prefill
What Is Chunked Prefill?
Chunked Prefill (Splitfuse) is a feature designed to break down long prompt requests into smaller chunks and schedule them across multiple forward steps. Generation for the prompt request only begins after the forward pass of the last chunk is completed. Short prompt requests are combined to precisely fill the gaps in the steps, ensuring that the computational load of each step is roughly equal, which helps to stabilize the average latency across all requests.
Key operations:
- Long prompts are divided into smaller chunks and scheduled over multiple iterations. Output token generation only occurs after the last iteration.
- During batch construction, a prefill chunk is combined with the remaining slots filled by decode operations, reducing the cost of batching only decode operations.
Its advantages mainly include:
- Improved efficiency: By efficiently combining short and long prompts, the model maintains high throughput.
- Enhanced consistency: Standardizing the forward pass size reduces latency fluctuations and stabilizes the generation frequency.
- Reduced latency: By balancing the computational utilization of prefill and decode, it reduces the P90_ttft (time to first token) and P90_tpot (time per output token) latencies, particularly in scenarios with short inputs, short outputs, and high concurrency.
Constraints
- This feature cannot be used simultaneously with automatic prefix caching (APC) and KV cache quantization.
- The Qwen series of models support this feature.
- It is enabled by default in v1. When enabling the vllm_ascend capability, additional configuration is required to ensure it remains enabled; otherwise, it will be disabled by default. See the table below for configuration details (Note: In cases 1 and 3, the chunked_prefill feature will be enabled; in case 2, it will be disabled).
Chunked Prefill Parameters
The parameters for chunked prefill are listed in the table below.
|
Configuration Item |
Type |
Range |
Description |
|---|---|---|---|
|
enable-chunked-prefill |
bool |
|
1. The chunked_prefill feature is enabled by default in the open-source vllm v1 scheduler on GPUs or NPUs.
2. When using vllm_ascend, you need to set ascend_scheduler_config to enabled = true. In this case, chunked_prefill will be disabled by default.
--additional-config="ascend_scheduler_config": {
"enabled": true
}
3. When using vllm_ascend and you want to enable chunked_prefill, you must add additional configuration under ascend_scheduler_config. Otherwise, it will remain disabled (as in case 2).
To enable chunked_prefill when using vllm_ascend, configure these settings:
--additional-config="ascend_scheduler_config": {
"enabled": true,
"chunked_prefill_enabled": true, # If chunked_prefill_enabled is not set to true, the chunked_prefill feature will remain disabled by default.
}
|
|
max-num-batched-tokens |
int |
≥ 256 and a multiple of 256 |
In chunked prefill mode, this parameter limits the maximum split length and must be greater than or equal to --max-model-len. Recommended values are 4096, 8192, or even larger. |
Inference Execution Reference
- To use chunked prefill in Ascend-vLLM, see Table 1. For details about other parameters, see Starting an LLM-powered Inference Service.
- For details about how to start the inference service, see Starting an LLM-powered Inference Service.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot