Updated on 2025-07-02 GMT+08:00

Audio Dataset Processing Operators

The platform supports the processing of audio datasets. For details about the audio processing operator capabilities, see Table 1.

Table 1 Audio dataset processing operator capabilities

Category

Operator Name

Operator Description

Data conversion

Noise Addition

Adds noise to audios.

Noise Suppression

Removes pure noise segments from audios and reduces noise.

Tone Change

Adjusts the pitch of the original audio.

Reverberation Reduction

Reduces the reverberation effect of the audio in the space and improves the intelligibility of the voice.

Voice Anonymization

Anonymizes the audio. The anonymized audio differs greatly from the the original one in the speaker timbre and voiceprint.

Voice Noise Reduction

Reduces the noise in the original audio. Only the situation where noise and human voices overlap is considered. No restriction is imposed on pure noise audio or pure noise segments.

Speaking Speed Adjustment

Adjusts the speaking speed in the audio.

Voice Style Conversion

Converts the original audio based on the specified target style.

Audio Quantization Encoding

Converts a high-resolution audio file with header information into a 16 kHz alaw/ulaw/pcm/wav file using audio encoding and decoding technologies and quantization compression technologies.

Data labeling

Speech language recognition and labeling

Identifies the language used by the speaker in the audio and provides the confidence.

Speech-to-Text Conversion (Mandarin)

Converts Mandarin speech into text quickly to enrich human-machine interaction scenarios.

Speech sentiment recognition and labeling

Recognizes the sentiments of speakers in the input audio.

Voice Activity Detection

Detects the start and end time of each segment of human voice in the audio.

Noise Level Evaluation

Scores the quality of the audio containing human voice segments.

Silent Segment Detection

Identifies silent segments in the audio and the confidence, and provides the proportion of silent segments.

Multi-speaker Speech Recognition

Identifies the audio content, and returns the start time, end time, and content of each speaker.

Personal Privacy Dialog Identification

Labels personal privacy voice content.

Prohibited speech detection

Labels prohibited speech.

Political Sensitive Speech Content Recognition Operator

Labels politically sensitive speech.

Pornographic speech content detection

Labels pornographic content.

Noise Addition

  • Applicable file format: pure audio file in WAV format (audio duration ≤ 60s; sampling rate: 16 kHz)
  • Operator description: Adds noise to the audio.
  • Parameter description:

    Noise type: type of the noise to be added. The mixed noise is the superposition of Gaussian noise and salt-and-pepper noise.

    Signal-to-noise ratio (SNR): ratio of the normal sound signal strength to the noise signal strength.

Noise Suppression

  • Applicable file format: pure audio file in WAV format (audio duration ≤ 30s; sampling rate: 16 kHz; bit depth: 16 bits; single-channel)
  • Operator description: Removes pure noise segments from audios and reduces noise.
  • Parameter configuration example

    No parameters need to be set.

Tone Change

  • Applicable file format: pure audio file (audio duration ≤ 60s)
  • Operator description: Adjusts the pitch of the original audio.
  • Parameter description

    Tone: tone parameter

Reverberation Reduction

  • Applicable file format: pure audio file in WAV format (audio duration ≤ 60s; sampling rate: 16 kHz)
  • Operator description: Reduces the reverberation effect of the audio in the space and improves the intelligibility of the voice.
  • Parameter configuration example

    No parameters need to be set.

Voice Anonymization

  • Applicable file format: pure audio file in WAV format (audio duration ≤ 30s; sampling rate: 16 kHz; bit depth: 16 bits; single-channel)
  • Operator description: Anonymizes the audio. The anonymized audio differs greatly from the the original one in the speaker timbre and voiceprint.
  • Parameter configuration example

    No parameters need to be set.

Voice Noise Reduction

  • Applicable file format: pure audio file in WAV format (sampling rate: 16 kHz; bit depth: 16 bits; single-channel)
  • Operator description: Reduces the noise in the original audio. Only the situation where noise and human voices overlap is considered. No restriction is imposed on pure noise audio or pure noise segments.
  • Parameter configuration example

    No parameters need to be set.

Speaking Speed Adjustment

  • Applicable file format: pure audio file in WAV format (audio duration ≤ 60s)
  • Operator description: Adjusts the speaking speed in the audio.
  • Parameter description

    Speaking speed: The value ranges from 0.5 to 2.

Voice Style Conversion

  • Applicable file format: pure audio file (file size ≤ 50 MB)
  • Operator description: Converts the original audio based on the specified target style.
  • Parameter description

    Voice style: voice style after conversion

Audio Quantization Encoding

  • Applicable file format: pure audio file (file size ≤ 100 MB)
  • Operator description: Converts a high-resolution audio file with header information into a 16 kHz alaw/ulaw/pcm/wav file using audio encoding and decoding technologies and quantization compression technologies.
  • Parameter configuration example

    No parameters need to be set.

Speech Language Recognition and Labeling

  • Applicable file format: pure audio file in WAV format (audio duration ≤ 60s; sampling rate: 16 kHz; bit depth: 16 bits)
  • Operator description: Identifies the language used by the speaker in the audio and provides the confidence.
  • Parameter configuration example

    No parameters need to be set.

Speech-to-Text Conversion (Mandarin)

  • Applicable file format: pure audio file (audio duration ≤ 60s)
  • Operator description: Converts Mandarin speech into text quickly to enrich human-machine interaction scenarios.
  • Parameter description

    Punctuation: whether to add punctuation marks to the recognition result

    Digit conversion: whether to recognize numbers in speech as Arabic numerals

    Word segmentation information: whether the recognition result contains the word segmentation result

Speech Sentiment Recognition and Labeling

  • Applicable file format: pure audio file in WAV format (audio duration ≤ 60s; sampling rate: 16 kHz; bit depth: 16 bits)
  • Operator description: Recognizes the sentiments of speakers in the input audio.
  • Parameter configuration example

    No parameters need to be set.

Voice Activity Detection

  • Applicable file format: pure audio file in WAV format (audio duration ≤ 600s; sampling rate: 16 kHz; bit depth: 16 bits)
  • Operator description: Detects the start and end time of each segment of human voice in the audio.
  • Parameter configuration example

    No parameters need to be set.

Noise Level Evaluation

  • Applicable file format: pure audio file in WAV format (audio duration ≤ 60s; sampling rate: 16 kHz; bit depth: 16 bits)
  • Operator description: Scores the quality of the audio containing human voice segments.
  • Parameter configuration example

    No parameters need to be set.

Silent Segment Detection

  • Applicable file format: pure audio file (audio duration ≤ 600s; sampling rate: 16 kHz; bit depth: 16 bits)
  • Operator description: Identifies silent segments in the audio and the confidence, and provides the proportion of silent segments.
  • Parameter configuration example

    No parameters need to be set.

Multi-speaker Speech Recognition

  • Applicable file formats: pure audio file (audio duration ≤ 1 hour; single-channel)
  • Operator description: Identifies the audio content, and returns the start time, end time, and content of each speaker.
  • Parameter description

    Punctuation: whether to add punctuation marks to the recognition result

    Digit conversion: whether to recognize numbers in speech as Arabic numerals

    Word segmentation information: whether the recognition result contains the word segmentation result

    Speaker separation: whether the recognition result contains speaker information

    Speaking speed: whether the recognition result contains the speaking speed

Personal Privacy Dialog Identification

  • Applicable file format: pure audio file (audio duration ≤ 60s)
  • Operator description: Labels personal privacy speech content.
  • Parameter configuration example

    No parameters need to be set.

Prohibited Speech Detection

  • Applicable file format: pure audio file (audio duration ≤ 60s)
  • Operator description: Labels prohibited speech content.
  • Parameter configuration example

    No parameters need to be set.

Political Sensitive Speech Content Recognition Operator

  • Applicable file format: pure audio file (audio duration ≤ 60s)
  • Operator description: Tags politically sensitive speech content.
  • Parameter configuration example

    No parameters need to be set.

Pornographic Speech Content Moderation Operator

  • Applicable file format: pure audio file (audio duration ≤ 60s)
  • Operator description: Tags pornographic speech content.
  • Parameter configuration example

    No parameters need to be set.