Starting Recognition

Function

After the wss handshake request receives a successful response, the communication protocol between the client and the server is upgraded to the WebSocket protocol. Through the WebSocket protocol, the client sends a recognition starting request for configuring related parameters.

Request Parameters

**Table 1** Parameter descriptions
Parameter	Mandatory	Type	Description
command	Yes	String	The client sends a recognition start request. Set it to START.
config	Yes	Object	Configuration information. Structure information. For details, see Table 2.

**Table 2** config data structure
Parameter	Mandatory	Type	Description
audio_format	Yes	String	Supported audio format. For details, see Table 4.
property	Yes	String	Model feature string in use. Generally, the value is in the language_sampling rate_domain format, for example, chinese_8k_common. For details, see Table 3.
add_punc	No	String	Whether to add punctuation marks to the recognition result. Possible values are yes and no. The default value is no.
vad_head	No	Integer	In single-sentence mode, when real-time ASR encounters an audio segment with initial silence duration longer than or equal to this parameter value, it returns an EXCEEDED_SILENCE event, ending the recognition process. Conversely, in continuous mode, the system will segment the audio and proceed to recognize the subsequent sentence. This parameter does not take effect in streaming mode. If set to 0, it is equivalent to setting it to 60000. Range: An integer within [0, 60000], measured in milliseconds. The default value is 10000 ms, which is 10 seconds.
vad_tail	No	Integer	Silence duration at the end of the audio. In normal cases, it should not be set to a small value. If the silence duration at the end of the audio is greater than or equal to the value of this parameter, the VOICE_END (the recognition result is not empty) or EXCEEDED_SILENCE (the recognition result is empty) event is returned in single-sentence mode and the recognition ends. In continuous mode, a sentence is segmented and the recognition of the next sentence continues. This parameter does not take effect in streaming mode. Range: An integer within [0, 3000], measured in milliseconds. The default value is 500 ms. Note: If vad_tail is set to a small value (< 200 ms), sentence segmentation will be performed too frequently, affecting the recognition result.
max_seconds	No	Integer	Maximum duration of a sentence. When the detected audio duration reaches or exceeds this value, in single-sentence mode, it will return either a VOICE_END event (the recognition result is not empty) or an EXCEEDED_SILENCE event (the recognition result is empty), thereby ending the recognition process. In continuous mode, it will segment the sentence and proceed with recognizing the next one. This parameter does not take effect in streaming mode. Range: An integer within [1, 60], measured in seconds. The default value is 30 seconds.
interim_results	No	String	Whether to output the intermediate result. The value can be yes or no. The default value is no, which indicates that the intermediate result is not output.
vocabulary_id	No	String	ID of a hot word table. If no hot word table is used, this field can be left blank. For details about how to create a hot word table, see Creating a Hot Word Table.

**Table 3** Value range of **property**
Value	Description
chinese_8k_general	Supports Chinese Mandarin speech recognition at a sampling rate of 8 kHz and uses the next-gen end-to-end recognition algorithm to achieve higher recognition accuracy.
chinese_16k_general	Supports Chinese Mandarin speech recognition at a sampling rate of 16 kHz and uses the next-gen end-to-end recognition algorithm to achieve higher recognition accuracy.
english_16k_general	Supports English speech recognition at a sampling rate of 16 kHz and uses the next-gen end-to-end recognition algorithm to achieve higher recognition accuracy. Digit normalization (digit_norm parameter) is not available.
arabic_16k_general	Supports Arabic speech recognition at a sampling rate of 16 kHz and uses the next-gen end-to-end recognition algorithm to accommodate Standard Arabic, Egyptian dialect, Saudi dialect, and UAE dialect. It does not support punctuation prediction (add_punc parameter), digit normalization (digit_norm parameter), or hot words (vocabulary_id parameter).

**Table 4** Value range of **audio_format**
Value	Description
pcm16k16bit	16 kHz, 16-bit mono-channel audio recording data
pcm8k16bit	8 kHz, 16-bit mono-channel audio recording data
ulaw16k8bit	16 kHz, 8-bit ulaw mono-channel audio recording data
ulaw8k8bit	8 kHz, 8-bit ulaw mono-channel audio recording data
alaw16k8bit	16 kHz, 8-bit alaw mono-channel audio recording data
alaw8k8bit	8 kHz, 8-bit alaw mono-channel audio recording data

Currently, only RAW audio files in PCM-encoded wav format are supported. Any other formats (with a WAV header or in ARM format) are not supported.

Example

{
  "command": "START",
  "config":
  {
    "audio_format": "ulaw8k8bit",
    "property": "chinese_8k_general",
    "add_punc": "yes",
    "vad_tail": 400,
    "interim_results": "yes",
   
  }
}