Help Center/ MetaStudio/ API Reference/ Appendix/ SSML Definition of Text Control
Updated on 2024-05-15 GMT+08:00

SSML Definition of Text Control

Text control on MetaStudio uses Speech Synthesis Markup Language (SSML) to control the behaviors of virtual humans, including actions, emotions, and multi-pronunciation words and pauses of TTS voice synthesis.

For the basic definition of SSML, see Speech Synthesis Markup Language (SSML) Version 1.0. On this basis, MetaStudio extends some fields to control virtual humans.

MetaStudio SSML currently supports the following capabilities:

  • Text pronunciation control during TTS voice synthesis
    The following tags are included:
    • <speak></speak> is the root node of the SSML text.
    • <break/> is used for mute pause. You can set the pause duration.
    • <phoneme></phoneme> is used to mark multi-pronunciation words.
    • <say-as></say-as> is used to specify the reading method of digits or English letters.
    • <sub></sub> is used to set the alias of the marked text, that is, the alternative reading method.
    • <prosody></prosody> is used to control the local speaking speed.

      MetaStudio contains multiple TTS timbres. The SSML tag capabilities supported by each timbre are different. You can obtain the tags supported by each timbre by calling the API for querying asset details.

speak

  • Description

    <speak></speak>: Root node of the SSML text.

  • Syntax
    1
    <speak>Enter the text with an SSML tag. </speak>
    
  • Property

    None

  • Tag relationship

    <speak> can contain text and tags, including <break>, <phoneme>, <say-as>, and <sub>.

break

  • Description

    break: Inserts a mute pause at any position.

  • Syntax
    1
    <break time="String"/>
    
  • Property
    Table 1 Property description

    Property Name

    Property Type

    Property Value

    Mandatory (Yes/No)

    Description

    time

    String

    Value range: 200 ms to 10s

    No

    Mute pause duration, in milliseconds.

    strength

    String

    The options are as follows:

    • none: no rhythm
    • x-weak: very short rhythm
    • weak: short rhythm
    • medium: medium rhythm
    • strong: long rhythm
    • x-strong: very long rhythm

    No

    Definition of rhythm

  • Tag relationship

    Any other tag cannot be contained.

  • Example value
    1
    2
    One sentence<break time="200ms"/>another sentence
    One sentence<break strength="strong"/>another sentence
    

phoneme

  • Description

    <phoneme></phoneme>: Pronunciation of a multi-pronunciation Chinese or English word

  • Syntax
    1
    2
    <phoneme ph="string">Text </phoneme>
    The <phoneme ph="W EH1 DH AH0">weather</phoneme> is very good.
    
  • Property
    Table 2 Property description

    Property Name

    Property Type

    Property Value

    Mandatory (Yes/No)

    Description

    ph

    String

    Pinyin or phoneme

    Yes

    • When you enter Chinese Pinyin, the tone is represented by 1, 2, 3, or 4. The value 5 indicates no tone.
    • CMU Pronouncing Dictionary
  • Tag relationship

    Text can be included but any other tag cannot.

  • Example value
    1
    The<phoneme ph="tian1 qi1">weather</phoneme>is good today.
    

    Obtain the Pinyin JS library based on Chinese characters. For details, see pinyin-pro.

say-as

  • Description

    <say-as></say-as>: Specifies text as a specific type of content, or spells an English word character by character.

  • Syntax
    1
    <say-as interpret-as="string">Digit or word</say-as>
    
  • Property
    Table 3 Property description

    Property Name

    Property Type

    Property Value

    Mandatory (Yes/No)

    Description

    interpret-as

    String

    • money: money
    • date: date
    • figure: value
    • phone: phone number
    • english: English word
    • spell: spelling an English word character by character

    Yes

    The content is interpreted as a given type of reading method.

  • Tag relationship

    Text can be included but any other tag cannot.

  • Example value
    1
    2
    3
    4
    5
    6
    <say-as interpret-as="money">15.55 RMB</say-as>
    <say-as interpret-as="date">2022/3/8</say-as>
    <say-as interpret-as="figure">175 cm</say-as>
    <say-as interpret-as="phone">151 12345678</say-as>
    <say-as interpret-as="english">Hello</say-as>
    <say-as interpret-as="spell">Hello</say-as><!-- Read: H E L L O -->
    

sub

  • Description

    <sub></sub>: Finds an alternative reading method.

  • Syntax
    1
    <sub alias="string">Text</sub>
    
  • Property
    Table 4 Property description

    Property Name

    Property Type

    Property Value

    Mandatory (Yes/No)

    Description

    alias

    String

    Alternative word

    Yes

    Replace the content of the tag with this value for reading.

  • Tag relationship

    Text can be included but any other tag cannot.

  • Example value
    What is actually read is Paul.
    1
    <sub alias="Paul">Paul</sub>is German.
    

prosody

  • Description

    <prosody></prosody>: Controls the local speaking speed.

  • Syntax
    1
    <prosody rate="50">Text </prosody>
    
  • Property
    Table 5 Property description

    Property Name

    Property Type

    Property Value

    Mandatory (Yes/No)

    Description

    rate

    String

    Percentage of the speaking speed.

    The value ranges from 50 to 200.

    Example: 50, indicating that the reading speed is 0.5 times the normal speed.

    Yes

    Speaking speed

  • Tag relationship

    Text can be included but any other tag cannot.

  • Remarks
    1
    <prosody rate="50"> Hello, everyone. I'm a MetaStudio virtual human.</prosody>