Help Center> > Service Overview >What Is SIS

What Is SIS

Speech Interaction Service (SIS) provides a man-machine interaction mode by open Application Programming Interfaces (APIs). You can obtain the speech interaction result by real-time access and API calling. SIS consists of the following sub-services:

  • ASR Customization (ASRC): leverages deep learning technology to optimize speech recognition for specific fields and allows you to define language models.

    ASRC provides the Sentence Transcription and Long Audio Transcription functions.

  • Real-Time ASR (RASR): converts continuous audio streams into text in real time, enabling faster speech recognition.
  • Automatic Speech Recognition (ASR): converts audio recordings into text.
  • Text To Speech (TTS): converts text into lifelike voices.

ASRC

ASRC provides the Sentence Transcription and Long Audio Transcription functions. Sentence Transcription recognizes audio recordings with a shorter duration at a faster speed, and Long Audio Transcription performs well in recognizing audio recordings with a longer duration.

  • Sentence Transcription converts audio recordings whose duration is less than 1 minute into text. After binary data is uploaded, the system automatically processes and converts it into text.
  • Long Audio Transcription recognizes long audio recordings and converts them into text. It features good scalability, provides different models for different domains, and supports hot word customization.

ASRC Advantages

  • High Recognition Rate

    Utilizes the deep learning technology to optimize the corpus for specific scenarios and fields, enabling an industry-leading recognition rate.

  • Cutting-Edge Technologies

    Combines mature speech recognition algorithms currently in active use in the industry with the latest research to empower enterprises with unique competitive advantages.

  • Hot Word Recognition

    Allows you to upload wordlists critical to your industry so that your professional jargon can be more accurately recognized.

  • Customizable Models

    Increases accuracy by using speech recognition models designed for the specific requirements of the vertical industry you operate in of for other specific scenarios.

RASR

You can access and call APIs to obtain the speech recognition result in real time.

RASR Advantages

  • High Recognition Accuracy

    Adopts the latest generation of speech recognition and Deep Neural Network (DNN) technologies to greatly improve the anti-noise performance and recognition accuracy.

  • High Speed

    Integrates the language models, dictionaries, and acoustic models into a large neural network featuring impressive optimizations in the engineering to greatly increase the decoding speed, achieving faster recognition.

  • Multiple Recognition Modes

    Supports multiple real-time speech recognition modes, including streaming, continuous, and single-sentence, to suit different application scenarios.

  • Customization Service

    Allows you to customize the language-layer model in a specific vertical domain to better recognize proprietary words and industry terms, adding a significant boost to accuracy.

RASR Functions

  • Text Timestamps

    Generates specific timestamps for the audio conversion result, so that you can quickly find the spot in the original audio clip to confirm the text and adopt if needed.

  • Intelligent Text Segmentation

    By extracting semantic features of the context and combining voice features, intelligently segments sentences and adds punctuation marks to improve the readability of the output text.

  • Hybrid Recognition

    Supports recognition of English letters/words and digits included in Chinese sentences.

  • Instant Result Output

    Continuously recognizes voice streams, outputs results in real time, and automatically corrects the content based on the context language model.

  • Automatic VAD

    Performs voice activity detection (VAD) on the input voice streams to improve recognition efficiency and accuracy.

  • Flexible Access Modes

    Supports access over WebSocket and MRCP interfaces.

ASR

ASR converts audio recordings whose duration is within 1 minute and whose size is less than 4 MB into text. After a complete audio file is uploaded, the system automatically converts it into text.

ASR Advantages

  • High Accuracy

    Employs deep learning technologies to achieve a speech recognition accuracy of over 95%.

  • Strong Language Support

    Supports recognition of speeches in Chinese Mandarin to meet the application requirements in various scenarios.

  • Solid Reliability

    Proven stability after years of experience in complex enterprise customer scenarios.

  • High Efficiency

    Provides standard RESTful APIs and service SDKs to facilitate use and integration while reducing labor and business costs.

TTS

TTS converts text into lifelike voices. It provides speech services with customizable timbres, volumes, and speeds for enterprises and individuals.

TTS Advantages

  • High Accuracy

    Employs deep learning technologies to rapidly synthesize natural, fluent human speech.

  • Customization

    Allows you to customize the timbre, tone, and speed of spoken text based on your needs.

  • Solid Reliability

    Proven stability after years of experience in complex enterprise customer scenarios.

  • High Efficiency

    Provides standard RESTful APIs and service SDKs to facilitate use and integration while reducing labor and business costs.