Usage of Speculative Inference

What Is Speculative Inference?

Traditional LLM inference primarily relies on an auto-regressive decoding method, where each decoding step can only produce one output token, and the historical output content must be concatenated and re-input into the LLM to proceed with the next decoding step. To address this issue, a speculative inference approach has been proposed. The core idea is to use a small model, which has a much lower computational cost than the LLM, to perform speculative inference. Specifically, the small model speculatively infers multiple steps at a time, and then these inference results are collected and validated by the LLM in a single step.

In speculative mode, the small model first infers tokens 1, 2, and 3 sequentially, and then these 3 tokens are input into the LLM for inference, producing tokens 1', 2', 3', and 4'. The tokens 1', 2', and 3' are then compared with 1', 2', and 3' respectively. This process allows for the generation of 1 to 4 tokens using three speculative inferences from the small model (which are much faster compared to the large model) and one inference from the large model, significantly improving inference performance.

In this way, speculative inference can bring the following advantages:

Shorter average decode time: Taking qwen2-72b as the large model and qwen2-0.5b as the small model, the inference time for the small model is less than 1/5 of the large model. After including the validation step, the time to execute a complete speculative inference cycle is only about 1.5 times that of the large model (with the speculative step set to 3). This single speculative inference cycle can, on average, generate 3 valid tokens. Thus, with a time cost of 1.5 times, it generates 3 times the number of tokens, resulting in a 100% performance improvement.

Speculative Inference Parameter Settings

When starting the offline or online inference service, set parameters by referring to Table 1 to use speculative inference.

**Table 1** Speculative inference parameters
Configuration Item	Parameter	Type	Description
--speculative-config	num_speculative_tokens	int	The number of tokens to predict each time, must be greater than or equal to 1. It is recommended to initially set this to 1, and then adjust based on the acceptance rate after confirming the benefits.
	method	str	Speculative method: currently only ngram is supported.
	prompt_lookup_min	int	Minimum match length, effective only when the method is set to ngram.
	prompt_lookup_max	int	Maximum match length, effective only when the method is set to ngram.

Example of E2E Speculative Inference

The following uses the Qwen3-32B model as the large model and the OpenAI API service as an example.

Start the inference service:

base_model=/path/to/base_model
export VLLM_PLUGINS=ascend_vllm
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
python -m vllm.entrypoints.openai.api_server --model=${base_model} \
--max-num-seqs=256 \
--max-model-len=8192 \
--max-num-batched-tokens=8192 \
--dtype=bfloat16 \
--tensor-parallel-size=4 \
--host=0.0.0.0 \
--port=18080 \
--gpu-memory-utilization=0.8 \
--trust-remote-code \
--additional-config='{"ascend_turbo_graph_config": {"enabled": true}}' \
--speculative-config '{"num_speculative_tokens":1,"method":"ngram","prompt_lookup_min":1,"prompt_lookup_max":8}'

Send a curl request.

curl --request POST \
  --url http://0.0.0.0:18080/v1/chat/completions \
  --header 'content-type: application/json' \
  --data '{
  "model": "${base_model}",
  "messages": [
    {
      "role": "user",
      "content": "Who is the current president of the United States?"
    }
  ],
  "max_tokens": 128,
  "top_k": -1,
  "top_p": 0.1,
  "temperature": 0,
  "stream": false,
  "repetition_penalty": 1.0
}
'