Help Center/ MetaStudio/ User Guide/ Voice Modeling/ Recording a Human Audio
Updated on 2024-11-27 GMT+08:00

Recording a Human Audio

You can upload a human audio recording to MetaStudio for AI training to obtain a voice model that reproduces the human timbre at 1:1.

The voice model can be used for text-to-speech conversion and applied to scenarios such as virtual avatar video production, livestreaming, and intelligent interaction. The recording requirements are as follows:

  • Advanced edition: a WAV audio file of 10–30 minutes (recommended: 15 minutes) with 100 phrases as a whole

Preparing for Recording

Table 1 Recording preparations

Edition

Recording Device and Software

Recording Environment

Recording Script

Advanced

Professional recording devices (recommended: Adobe Audition) are preferred for audio recording.

If professional recording devices are not available, you can use your mobile phone for recording. See Recording an Audio on a Mobile Phone.

  • Try to record the audio in a quiet environment without echo, reverberation, and noises caused by car horns, talking, or walking.
  • You can use the decibel (dB) meter app to test the background noise in the recording environment. The background noise should be lower than 0 dB.
  • The recording device and environment cannot be changed for the same recording task.

You are advised to use Script Examples (Advanced Edition).

You can also customize the script. The length of one phrase must be the same as that in the example.

Improvised recording is not recommended as there may be too many fillers that compromise the speech coherence.

Starting Recording

The recorded audio must be high-quality, free of noises and background sounds, and of the same person. You can use an iPhone or Android mobile phone to record videos. See Recording an Audio on a Mobile Phone.

Table 2 describes the precautions for recording.

Table 2 Recording precautions

Item

Description

Distance from the microphone

Adjust the distance from the microphone. The one-punch distance is appropriate. To avoid pop sound effects or recording the breath sound, do not be too close to the microphone.

Recording content

The starting number of each piece of script does not need to be read.

For example, for the script "4. It features a multitude of functions and superior performance", 4 does not need to be read.

Audio format

Save the audio file in a lossless format, for example, WAV.

The recording data should not be encoded (sample rate of 48 kHz, sample bit of 16 bits, and mono).

Speech style

Keep the speech style consistent throughout the recording to avoid excessive emotions.

Pronunciation

Pronunciation should be clear and accurate, and the volume should be moderate. If there is undesired sound, record the phrase again.

Speed and rhythm

The speed of speech should be natural and stable. Do not be too fast or too slow.

Moderate volume

The volume cannot be too low or too high, or fluctuate. Clipping noise is not allowed.

Pause

Pause naturally and breathe softly upon punctuations and appropriate positions.

There must be a pause of 2–3 seconds between phrases for a long audio file.

Accent position

Find the correct accent position to avoid wrong accent.

Reading pronunciation

Read in order, ensure the phonetic consistency (avoid missing or adding words), and avoid wrong pronunciation. If there is a misreading or the reading is not smooth, record the whole phrase again.

Submitting an Audio File

Table 3 Recording submission

Edition

Audio Description

Audio Naming

Advanced

Record all phrases in a WAV audio file, with a pause of 2 to 3 seconds between each phrase. You can upload the WAV file to the MetaStudio console without compressing it or providing a TXT script file.

The preset script is recommended. You can also customize the script. The text is automatically split based on pauses and identified.

You can customize the audio file name, for example, Voice.wav.

Creating a Voice Model

After the audio file is available, you can upload it to the MetaStudio console for voice training by following:

It takes about one to three working days to produce a voice of advanced edition.

Application scenarios of a customized voice:

  • After a customized voice is generated, it is automatically displayed in the voice list on the MetaStudio console. This voice can be used in scenarios such as virtual avatar video production and livestreaming.
  • A customized voice can be called using the APIs of MetaStudio.