Updated on 2025-11-04 GMT+08:00

Usage of Speculative Inference

What Is Speculative Inference?

Traditional LLM inference primarily relies on an auto-regressive decoding method, where each decoding step can only produce one output token, and the historical output content must be concatenated and re-input into the LLM to proceed with the next decoding step. To address this issue, a speculative inference approach has been proposed. The core idea is to use a small model, which has a much lower computational cost than the LLM, to perform speculative inference. Specifically, the small model speculatively infers multiple steps at a time, and then these inference results are collected and validated by the LLM in a single step.

In speculative mode, the small model first infers tokens 1, 2, and 3 sequentially, and then these 3 tokens are input into the LLM for inference, producing tokens 1', 2', 3', and 4'. The tokens 1', 2', and 3' are then compared with 1', 2', and 3' respectively. This process allows for the generation of 1 to 4 tokens using three speculative inferences from the small model (which are much faster compared to the large model) and one inference from the large model, significantly improving inference performance.

In this way, speculative inference can bring the following advantages:

Shorter average decode time: Taking qwen2-72b as the large model and qwen2-0.5b as the small model, the inference time for the small model is less than 1/5 of the large model. After including the validation step, the time to execute a complete speculative inference cycle is only about 1.5 times that of the large model (with the speculative step set to 3). This single speculative inference cycle can, on average, generate 3 valid tokens. Thus, with a time cost of 1.5 times, it generates 3 times the number of tokens, resulting in a 100% performance improvement.

Speculative Inference Parameter Settings

When starting the offline or online inference service, set parameters by referring to Table 1 to use speculative inference.

Table 1 Speculative inference parameters

Configuration Item

Parameter

Type

Description

--speculative-config

num_speculative_tokens

int

The number of tokens to predict each time, must be greater than or equal to 1. It is recommended to initially set this to 1, and then adjust based on the acceptance rate after confirming the benefits.

method

str

Speculative method: currently only ngram is supported.

prompt_lookup_min

int

Minimum match length, effective only when the method is set to ngram.

prompt_lookup_max

int

Maximum match length, effective only when the method is set to ngram.

Example of E2E Speculative Inference

The following uses the Qwen3-32B model as the large model and the OpenAI API service as an example.

  1. Start the inference service:
    base_model=/path/to/base_model
    export VLLM_PLUGINS=ascend_vllm
    export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
    python -m vllm.entrypoints.openai.api_server --model=${base_model} \
    --max-num-seqs=256 \
    --max-model-len=8192 \
    --max-num-batched-tokens=8192 \
    --dtype=bfloat16 \
    --tensor-parallel-size=4 \
    --host=0.0.0.0 \
    --port=18080 \
    --gpu-memory-utilization=0.8 \
    --trust-remote-code \
    --additional-config='{"ascend_turbo_graph_config": {"enabled": true}}' \
    --speculative-config '{"num_speculative_tokens":1,"method":"ngram","prompt_lookup_min":1,"prompt_lookup_max":8}'
  2. Send a curl request.
    curl --request POST \
      --url http://0.0.0.0:18080/v1/chat/completions \
      --header 'content-type: application/json' \
      --data '{
      "model": "${base_model}",
      "messages": [
        {
          "role": "user",
          "content": "Who is the current president of the United States?"
        }
      ],
      "max_tokens": 128,
      "top_k": -1,
      "top_p": 0.1,
      "temperature": 0,
      "stream": false,
      "repetition_penalty": 1.0
    }
    '

Inference Execution Reference

  1. Configure the service parameters. To use this feature in Ascend-vLLM, see Table 1. For details about other parameters, see Starting an LLM-powered Inference Service.
  2. Start the service. For details, see Starting an LLM-powered Inference Service.
  3. Evaluate the accuracy and performance. For details, see Inference Service Accuracy Evaluation and Inference Service Performance Evaluation.