Starting Recognition
Function
After the wss handshake request receives a successful response, the communication protocol between the client and the server is upgraded to the WebSocket protocol. Through the WebSocket protocol, the client sends a recognition starting request for configuring related parameters.
Request Parameters
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
command |
Yes |
String |
The client sends a recognition start request. Set it to START. |
config |
Yes |
Object |
Configuration information. Structure information. For details, see Table 2. |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
audio_format |
Yes |
String |
Supported audio format. For details, see Table 4. |
property |
Yes |
String |
Model feature string in use. Generally, the value is in the language_sampling rate_domain format, for example, chinese_8k_common. For details, see Table 3. |
add_punc |
No |
String |
Whether to add punctuation marks to the recognition result. Possible values are yes and no. The default value is no. |
vad_head |
No |
Integer |
In single-sentence mode, when real-time ASR encounters an audio segment with initial silence duration longer than or equal to this parameter value, it returns an EXCEEDED_SILENCE event, ending the recognition process. Conversely, in continuous mode, the system will segment the audio and proceed to recognize the subsequent sentence. This parameter does not take effect in streaming mode. If set to 0, it is equivalent to setting it to 60000. Range: An integer within [0, 60000], measured in milliseconds. The default value is 10000 ms, which is 10 seconds. |
vad_tail |
No |
Integer |
Silence duration at the end of the audio. In normal cases, it should not be set to a small value. If the silence duration at the end of the audio is greater than or equal to the value of this parameter, the VOICE_END (the recognition result is not empty) or EXCEEDED_SILENCE (the recognition result is empty) event is returned in single-sentence mode and the recognition ends. In continuous mode, a sentence is segmented and the recognition of the next sentence continues. This parameter does not take effect in streaming mode. Range: An integer within [0, 3000], measured in milliseconds. The default value is 500 ms. Note: If vad_tail is set to a small value (< 200 ms), sentence segmentation will be performed too frequently, affecting the recognition result. |
max_seconds |
No |
Integer |
Maximum duration of a sentence. When the detected audio duration reaches or exceeds this value, in single-sentence mode, it will return either a VOICE_END event (the recognition result is not empty) or an EXCEEDED_SILENCE event (the recognition result is empty), thereby ending the recognition process. In continuous mode, it will segment the sentence and proceed with recognizing the next one. This parameter does not take effect in streaming mode. Range: An integer within [1, 60], measured in seconds. The default value is 30 seconds. |
interim_results |
No |
String |
Whether to output the intermediate result. The value can be yes or no. The default value is no, which indicates that the intermediate result is not output. |
vocabulary_id |
No |
String |
ID of a hot word table. If no hot word table is used, this field can be left blank. For details about how to create a hot word table, see Creating a Hot Word Table. |
Value |
Description |
---|---|
chinese_8k_general |
Supports Chinese Mandarin speech recognition at a sampling rate of 8 kHz and uses the next-gen end-to-end recognition algorithm to achieve higher recognition accuracy. |
chinese_16k_general |
Supports Chinese Mandarin speech recognition at a sampling rate of 16 kHz and uses the next-gen end-to-end recognition algorithm to achieve higher recognition accuracy. |
english_16k_general |
Supports English speech recognition at a sampling rate of 16 kHz and uses the next-gen end-to-end recognition algorithm to achieve higher recognition accuracy. Digit normalization (digit_norm parameter) is not available. |
arabic_16k_general |
Supports Arabic speech recognition at a sampling rate of 16 kHz and uses the next-gen end-to-end recognition algorithm to accommodate Standard Arabic, Egyptian dialect, Saudi dialect, and UAE dialect. It does not support punctuation prediction (add_punc parameter), digit normalization (digit_norm parameter), or hot words (vocabulary_id parameter). |
Value |
Description |
---|---|
pcm16k16bit |
16 kHz, 16-bit mono-channel audio recording data |
pcm8k16bit |
8 kHz, 16-bit mono-channel audio recording data |
ulaw16k8bit |
16 kHz, 8-bit ulaw mono-channel audio recording data |
ulaw8k8bit |
8 kHz, 8-bit ulaw mono-channel audio recording data |
alaw16k8bit |
16 kHz, 8-bit alaw mono-channel audio recording data |
alaw8k8bit |
8 kHz, 8-bit alaw mono-channel audio recording data |

Currently, only RAW audio files in PCM-encoded wav format are supported. Any other formats (with a WAV header or in ARM format) are not supported.
Example
{ "command": "START", "config": { "audio_format": "ulaw8k8bit", "property": "chinese_8k_general", "add_punc": "yes", "vad_tail": 400, "interim_results": "yes", } }
Status Codes
See Status Codes.
Error Codes
See Error Codes.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot