Audio Dataset Processing Operators

The platform supports the processing of audio datasets. For details about the audio processing operator capabilities, see Table 1.

**Table 1** Audio dataset processing operator capabilities
Category	Operator Name	Operator Description
Data conversion	Noise Addition	Adds noise to audios.
	Noise Suppression	Removes pure noise segments from audios and reduces noise.
	Tone Change	Adjusts the pitch of the original audio.
	Reverberation Reduction	Reduces the reverberation effect of the audio in the space and improves the intelligibility of the voice.
	Voice Anonymization	Anonymizes the audio. The anonymized audio differs greatly from the the original one in the speaker timbre and voiceprint.
	Voice Noise Reduction	Reduces the noise in the original audio. Only the situation where noise and human voices overlap is considered. No restriction is imposed on pure noise audio or pure noise segments.
	Speaking Speed Adjustment	Adjusts the speaking speed in the audio.
	Voice Style Conversion	Converts the original audio based on the specified target style.
	Audio Quantization Encoding	Converts a high-resolution audio file with header information into a 16 kHz alaw/ulaw/pcm/wav file using audio encoding and decoding technologies and quantization compression technologies.
Data labeling	Speech language recognition and labeling	Identifies the language used by the speaker in the audio and provides the confidence.
	Speech-to-Text Conversion (Mandarin)	Converts Mandarin speech into text quickly to enrich human-machine interaction scenarios.
	Speech sentiment recognition and labeling	Recognizes the sentiments of speakers in the input audio.
	Voice Activity Detection	Detects the start and end time of each segment of human voice in the audio.
	Noise Level Evaluation	Scores the quality of the audio containing human voice segments.
	Silent Segment Detection	Identifies silent segments in the audio and the confidence, and provides the proportion of silent segments.
	Multi-speaker Speech Recognition	Identifies the audio content, and returns the start time, end time, and content of each speaker.
	Personal Privacy Dialog Identification	Labels personal privacy voice content.
	Prohibited speech detection	Labels prohibited speech.
	Political Sensitive Speech Content Recognition Operator	Labels politically sensitive speech.
	Pornographic speech content detection	Labels pornographic content.

Noise Addition

Applicable file format: pure audio file in WAV format (audio duration ≤ 60s; sampling rate: 16 kHz)
Operator description: Adds noise to the audio.
Parameter description:
Noise type: type of the noise to be added. The mixed noise is the superposition of Gaussian noise and salt-and-pepper noise.

Signal-to-noise ratio (SNR): ratio of the normal sound signal strength to the noise signal strength.

Noise Suppression

Applicable file format: pure audio file in WAV format (audio duration ≤ 30s; sampling rate: 16 kHz; bit depth: 16 bits; single-channel)
Operator description: Removes pure noise segments from audios and reduces noise.
Parameter configuration example
No parameters need to be set.

Tone Change

Applicable file format: pure audio file (audio duration ≤ 60s)
Operator description: Adjusts the pitch of the original audio.
Parameter description
Tone: tone parameter

Reverberation Reduction

Applicable file format: pure audio file in WAV format (audio duration ≤ 60s; sampling rate: 16 kHz)
Operator description: Reduces the reverberation effect of the audio in the space and improves the intelligibility of the voice.
Parameter configuration example
No parameters need to be set.

Voice Anonymization

Applicable file format: pure audio file in WAV format (audio duration ≤ 30s; sampling rate: 16 kHz; bit depth: 16 bits; single-channel)
Operator description: Anonymizes the audio. The anonymized audio differs greatly from the the original one in the speaker timbre and voiceprint.
Parameter configuration example
No parameters need to be set.

Voice Noise Reduction

Applicable file format: pure audio file in WAV format (sampling rate: 16 kHz; bit depth: 16 bits; single-channel)
Operator description: Reduces the noise in the original audio. Only the situation where noise and human voices overlap is considered. No restriction is imposed on pure noise audio or pure noise segments.
Parameter configuration example
No parameters need to be set.

Speaking Speed Adjustment

Applicable file format: pure audio file in WAV format (audio duration ≤ 60s)
Operator description: Adjusts the speaking speed in the audio.
Parameter description
Speaking speed: The value ranges from 0.5 to 2.

Voice Style Conversion

Applicable file format: pure audio file (file size ≤ 50 MB)
Operator description: Converts the original audio based on the specified target style.
Parameter description
Voice style: voice style after conversion

Audio Quantization Encoding

Applicable file format: pure audio file (file size ≤ 100 MB)
Operator description: Converts a high-resolution audio file with header information into a 16 kHz alaw/ulaw/pcm/wav file using audio encoding and decoding technologies and quantization compression technologies.
Parameter configuration example
No parameters need to be set.

Speech Language Recognition and Labeling

Applicable file format: pure audio file in WAV format (audio duration ≤ 60s; sampling rate: 16 kHz; bit depth: 16 bits)
Operator description: Identifies the language used by the speaker in the audio and provides the confidence.
Parameter configuration example
No parameters need to be set.

Speech-to-Text Conversion (Mandarin)

Applicable file format: pure audio file (audio duration ≤ 60s)
Operator description: Converts Mandarin speech into text quickly to enrich human-machine interaction scenarios.
Parameter description
Punctuation: whether to add punctuation marks to the recognition result

Digit conversion: whether to recognize numbers in speech as Arabic numerals

Word segmentation information: whether the recognition result contains the word segmentation result

Speech Sentiment Recognition and Labeling

Applicable file format: pure audio file in WAV format (audio duration ≤ 60s; sampling rate: 16 kHz; bit depth: 16 bits)
Operator description: Recognizes the sentiments of speakers in the input audio.
Parameter configuration example
No parameters need to be set.

Voice Activity Detection

Applicable file format: pure audio file in WAV format (audio duration ≤ 600s; sampling rate: 16 kHz; bit depth: 16 bits)
Operator description: Detects the start and end time of each segment of human voice in the audio.
Parameter configuration example
No parameters need to be set.

Noise Level Evaluation

Applicable file format: pure audio file in WAV format (audio duration ≤ 60s; sampling rate: 16 kHz; bit depth: 16 bits)
Operator description: Scores the quality of the audio containing human voice segments.
Parameter configuration example
No parameters need to be set.

Silent Segment Detection

Applicable file format: pure audio file (audio duration ≤ 600s; sampling rate: 16 kHz; bit depth: 16 bits)
Operator description: Identifies silent segments in the audio and the confidence, and provides the proportion of silent segments.
Parameter configuration example
No parameters need to be set.

Multi-speaker Speech Recognition

Applicable file formats: pure audio file (audio duration ≤ 1 hour; single-channel)
Operator description: Identifies the audio content, and returns the start time, end time, and content of each speaker.
Parameter description
Punctuation: whether to add punctuation marks to the recognition result

Digit conversion: whether to recognize numbers in speech as Arabic numerals

Word segmentation information: whether the recognition result contains the word segmentation result

Speaker separation: whether the recognition result contains speaker information

Speaking speed: whether the recognition result contains the speaking speed