Prosody-TTS: Improving Prosody with Masked Autoencoder and Conditional Diffusion Model For Expressive Text-to-Speech

Rongjie Huang; Chunlei Zhang; Yi Ren; Zhou Zhao; Dong Yu

Prosody-TTS: Improving Prosody with Masked Autoencoder and Conditional Diffusion Model For Expressive Text-to-Speech

Rongjie Huang, Chunlei Zhang, Yi Ren, Zhou Zhao, Dong Yu

📝 Paper

Anthology

Underline 📺 Watch Video on Underline Add to Favorites

Findings: Speech and Multimodality Findings Paper

Session 4: Speech and Multimodality (Virtual Poster)

Conference Room: Pier 7&8

Conference Time: July 11, 11:00-12:30 (EDT) (America/Toronto)

Global Time: July 11, Session 4 (15:00-16:30 UTC)

Spotlight Session: Spotlight - Metropolitan Centre (Spotlight)

Conference Room: Metropolitan Centre

Conference Time: July 10, 19:00-21:00 (EDT) (America/Toronto)

Global Time: July 10, Spotlight Session (23:00-01:00 UTC)

Keywords: spoken language understanding, speech and vision, speech technologies, multimodality

TLDR: Expressive text-to-speech aims to generate high-quality samples with rich and diverse prosody, which is hampered by \textbf{dual challenges}: 1) prosodic attributes in highly dynamic voices are difficult to capture and model without intonation; and 2) highly multimodal prosodic representations canno...

You can open the #paper-P553 channel in a separate window.

Abstract: Expressive text-to-speech aims to generate high-quality samples with rich and diverse prosody, which is hampered by \textbf{dual challenges}: 1) prosodic attributes in highly dynamic voices are difficult to capture and model without intonation; and 2) highly multimodal prosodic representations cannot be well learned by simple regression (e.g., MSE) objectives, which causes blurry and over-smoothing predictions. This paper proposes Prosody-TTS, a two-stage pipeline that enhances \textbf{prosody modeling and sampling} by introducing several components: 1) a self-supervised masked autoencoder to model the prosodic representation without relying on text transcriptions or local prosody attributes, which ensures to cover diverse speaking voices with superior generalization; and 2) a diffusion model to sample diverse prosodic patterns within the latent space, which prevents TTS models from generating samples with dull prosodic performance. Experimental results show that Prosody-TTS achieves new state-of-the-art in text-to-speech with natural and expressive synthesis. Both subjective and objective evaluation demonstrate that it exhibits superior audio quality and prosody naturalness with rich and diverse prosodic attributes. Audio samples are available at https://improved\_prosody.github.io