Updated on 2025-11-04 GMT+08:00

Reasoning Outputs

Scenarios

Reasoning outputs support DeepSeek-R1 and Qwen3 series. They produce outputs with detailed inference steps and final results. Using reasoning outputs adds a reasoning_content field to show the thought process and logic behind each conclusion.

Supported Models

Series

Parser

DeepSeek-R1

deepseek_r1

QwQ-32B

deepseek_r1

Qwen3

qwen3

Constraints

Reasoning outputs only apply to the /v1/chat/completions API of OpenAI.

Enabling Reasoning Outputs

Add the following command when starting the inference service.

--enable-reasoning --reasoning-parser xxx  

Note: XXX indicates the name of the reasoning parser that matches the model.

Disabling Model Chain-of-Thought Output

Currently, only the Qwen3 series models support disabling the chain-of-thought output. You can do this by adding the template parameter "enable_thinking": false when making an inference request. The request body example is as follows:

{
  "model": "Qwen3-8B",
  "chat_template_kwargs": {
    "enable_thinking": false
  },
  "messages": [
    {
      "role": "user",
      "content": "Hello"
    }
  ],
  "temperature": 0,
  "stream": false
}

Removing the Max Token Limit for Reasoning Content

Ascend-vLLM supports enabling or disabling the ability to remove the max token limit for reasoning content by setting an environment variable before starting the inference service. The environment variable usage example is as follows

export ENABLE_MAX_TOKENS_EXCLUDE_REASONING=1
  • Not setting the environment variable or setting it to 0: The max token parameter will control and truncate the length of the reasoning content field. This behavior is consistent with the community standard.
  • Setting the environment variable to 1: The max token parameter will not control or truncate the reasoning content field. It will only control the length of the content field.