Updated on 2025-08-26 GMT+08:00

Starting Recognition

Function

After the wss handshake request receives a successful response, the communication protocol between the client and the server is upgraded to the WebSocket protocol. Through the WebSocket protocol, the client sends a recognition starting request for configuring related parameters.

Request Parameters

Table 1 Parameter descriptions

Parameter

Mandatory

Type

Description

command

Yes

String

The client sends a recognition start request. Set it to START.

config

Yes

Object

Configuration information. Structure information. For details, see Table 2.

Table 2 config data structure

Parameter

Mandatory

Type

Description

audio_format

Yes

String

Supported audio format. For details, see Table 4.

property

Yes

String

Model feature string in use. Generally, the value is in the language_sampling rate_domain format, for example, chinese_8k_common. For details, see Table 3.

add_punc

No

String

Whether to add punctuation marks to the recognition result. Possible values are yes and no. The default value is no.

vad_head

No

Integer

In single-sentence mode, when real-time ASR encounters an audio segment with initial silence duration longer than or equal to this parameter value, it returns an EXCEEDED_SILENCE event, ending the recognition process. Conversely, in continuous mode, the system will segment the audio and proceed to recognize the subsequent sentence. This parameter does not take effect in streaming mode.

If set to 0, it is equivalent to setting it to 60000.

Range: An integer within [0, 60000], measured in milliseconds. The default value is 10000 ms, which is 10 seconds.

vad_tail

No

Integer

Silence duration at the end of the audio. In normal cases, it should not be set to a small value.

If the silence duration at the end of the audio is greater than or equal to the value of this parameter, the VOICE_END (the recognition result is not empty) or EXCEEDED_SILENCE (the recognition result is empty) event is returned in single-sentence mode and the recognition ends. In continuous mode, a sentence is segmented and the recognition of the next sentence continues. This parameter does not take effect in streaming mode.

Range: An integer within [0, 3000], measured in milliseconds. The default value is 500 ms.

Note: If vad_tail is set to a small value (< 200 ms), sentence segmentation will be performed too frequently, affecting the recognition result.

max_seconds

No

Integer

Maximum duration of a sentence. When the detected audio duration reaches or exceeds this value, in single-sentence mode, it will return either a VOICE_END event (the recognition result is not empty) or an EXCEEDED_SILENCE event (the recognition result is empty), thereby ending the recognition process. In continuous mode, it will segment the sentence and proceed with recognizing the next one. This parameter does not take effect in streaming mode.

Range: An integer within [1, 60], measured in seconds. The default value is 30 seconds.

interim_results

No

String

Whether to output the intermediate result. The value can be yes or no. The default value is no, which indicates that the intermediate result is not output.

vocabulary_id

No

String

ID of a hot word table. If no hot word table is used, this field can be left blank.

For details about how to create a hot word table, see Creating a Hot Word Table.

Table 3 Value range of property

Value

Description

chinese_8k_general

Supports Chinese Mandarin speech recognition at a sampling rate of 8 kHz and uses the next-gen end-to-end recognition algorithm to achieve higher recognition accuracy.

chinese_16k_general

Supports Chinese Mandarin speech recognition at a sampling rate of 16 kHz and uses the next-gen end-to-end recognition algorithm to achieve higher recognition accuracy.

english_16k_general

Supports English speech recognition at a sampling rate of 16 kHz and uses the next-gen end-to-end recognition algorithm to achieve higher recognition accuracy. Digit normalization (digit_norm parameter) is not available.

arabic_16k_general

Supports Arabic speech recognition at a sampling rate of 16 kHz and uses the next-gen end-to-end recognition algorithm to accommodate Standard Arabic, Egyptian dialect, Saudi dialect, and UAE dialect. It does not support punctuation prediction (add_punc parameter), digit normalization (digit_norm parameter), or hot words (vocabulary_id parameter).

Table 4 Value range of audio_format

Value

Description

pcm16k16bit

16 kHz, 16-bit mono-channel audio recording data

pcm8k16bit

8 kHz, 16-bit mono-channel audio recording data

ulaw16k8bit

16 kHz, 8-bit ulaw mono-channel audio recording data

ulaw8k8bit

8 kHz, 8-bit ulaw mono-channel audio recording data

alaw16k8bit

16 kHz, 8-bit alaw mono-channel audio recording data

alaw8k8bit

8 kHz, 8-bit alaw mono-channel audio recording data

Currently, only RAW audio files in PCM-encoded wav format are supported. Any other formats (with a WAV header or in ARM format) are not supported.

Example

{
  "command": "START",
  "config":
  {
    "audio_format": "ulaw8k8bit",
    "property": "chinese_8k_general",
    "add_punc": "yes",
    "vad_tail": 400,
    "interim_results": "yes",
   
  }
}

Status Codes

See Status Codes.

Error Codes

See Error Codes.