Fast speech style adaptation with adjustable prosody and variable duration

Abstract

For achieving personalized speech synthesis, it is indispensable to synthesize speech with diverse prosody for any given text. This task presents two key challenges: first, existing methods struggle to simultaneously extract local prosody information and control phoneme duration, while overlooking the impact of duration on prosody; second, current speaker adaptation approaches suffer from slow learning speed or poor generalization to unseen speakers outside the training set. To address the aforementioned issues, this paper introduces a novel framework. Our method innovatively introduces the text-to-speech alignment mechanism into prosody modeling, using the aligned text-to-duration to segment speech and obtain local prosodic information, and simultaneously training the two components simplifies the workflow. After obtaining the prosodic information, we use it as a condition to guide the model to learn the corresponding phoneme durations under different types of prosody. We combine this style control work with adapter fine-tuning to quickly synthesize speech with the speaker’s style using small amounts of data from unseen speakers in the training set. Experimental results show that our approach is effective in adjusting prosody and variable duration as well as fast style adapter, and the subjective evaluations of the prosodic modulation model considering duration exhibits a significant improvement.

FullText(HTML)