Help Center> > Service Overview >What Is SIS

What Is SIS

Speech Interaction Service (SIS) provides a man-machine interaction mode by open Application Programming Interfaces (APIs). You can obtain the speech interaction result by real-time access and API calling. SIS consists of the following three sub-services:

  • Automatic Speech Recognition (ASR): allows you to convert audio recordings into text.
  • Text To Speech (TTS): converts text into lifelike voices.
  • Real-Time ASR (RASR): converts continuous audio streams into text in real time, enabling faster speech recognition.

ASR

Currently, ASR provides the Sentence Recognition and Long Speech Recognition functions. Audio recordings with a shorter duration can be recognized at a faster speed by Sentence Recognition. For audio recordings with a longer duration, recognition by Long Speech Recognition delivers better effect.

  • Sentence Recognition converts audio recordings whose duration is within 1 minute and whose size is less than 4 MB into text.
  • Long Speech Recognition converts audio recordings whose duration is less than 4 hours into text. After a complete audio file is uploaded, the system automatically converts it into text.

ASR Advantages

  • High Accuracy

    Employs deep learning technologies to achieve a speech recognition accuracy of over 95%.

  • Strong Language Support

    Supports recognition of speeches in Chinese Mandarin to meet the application requirements in various scenarios.

  • Solid Reliability

    Proven stability after years of experience in complex enterprise customer scenarios.

  • High Efficiency

    Provides standard RESTful APIs and service SDKs to facilitate use and integration while reducing labor and business costs.

TTS

TTS converts text into lifelike voices. It provides speech services with customizable timbres, volumes, and speeds for enterprises and individuals.

ASR Advantages

  • High Accuracy

    Employs deep learning technologies to rapidly synthesize natural, fluent human speech.

  • Customization

    Allows you to customize the timbre, tone, and speed of spoken text based on your needs.

  • Solid Reliability

    Proven stability after years of experience in complex enterprise customer scenarios.

  • High Efficiency

    Provides standard RESTful APIs and various SDKs to facilitate use and integration while reducing labor and business costs.

RASR

You can access and call APIs to obtain the speech recognition result in real time.

RASR Advantages

  • High Recognition Accuracy

    Adopts the latest generation of speech recognition and Deep Neural Network (DNN) technologies to greatly improve the anti-noise performance and recognition accuracy.

  • High Speed

    Integrates the language models, dictionaries, and acoustic models into a large neural network featuring impressive optimizations in the engineering to greatly increase the decoding speed, achieving faster recognition.

  • Multiple Recognition Modes

    Supports multiple real-time speech recognition modes, including streaming, continuous, and single-sentence, to suit different application scenarios.

  • Customization Service

    Allows you to customize the language-layer model in a specific vertical domain to better recognize proprietary words and industry terms, adding a significant boost to accuracy.

RASR Functions

  • Text Timestamps

    Generates specific timestamps for the audio conversion result, so that you can quickly find the spot in the original audio clip to confirm the text and adopt if needed.

  • Intelligent Text Segmentation

    By extracting semantic features of the context and combining voice features, intelligently segments sentences and adds punctuation marks to improve the readability of the output text.

  • Hybrid Recognition

    Supports recognition of English letters/words and digits included in Chinese sentences.

  • Support for Multiple Dialects and Minority Languages

    Ability to recognize various dialects, such as Cantonese, Sichuanese, and Hokkienese, as well as many minority languages, such as Mongolian, Tibetan, and Uyghur.

  • Instant Result Output

    Continuously recognizes voice streams, outputs results in real time, and automatically corrects the content based on the context language model.

  • Automatic VAD

    Performs voice activity detection (VAD) on the input voice streams to improve recognition efficiency and accuracy.

  • Flexible Access Modes

    Access over WebSocket and MRCP interfaces.