论文标题
使用口语样式转换来增强文本到语音综合中的语音清晰度
Enhancing Speech Intelligibility in Text-To-Speech Synthesis using Speaking Style Conversion
论文作者
论文摘要
数字助手的采用增加使文本到语音(TTS)合成系统是现代移动设备的必不可少的特征。因此,希望构建能够在噪声存在下产生高度可理解的语音的系统。过去的研究调查了TTS合成中的样式转换,但合成的质量降低了,通常会导致更差的清晰度。为了克服这种局限性,我们提出了一种使用基于TACOTRON和WAVERNN TTS合成的新型转移学习方法。提出的语音系统利用了两种修改策略:(a)伦巴第语言样式数据和(b)光谱塑形和动态范围压缩(SSDRC),该数据已证明可以通过将信号能量重新分布在时间频率域上,从而提供较高的可观可观性提高。我们将此扩展名为Lombard-SSDRC TTS系统。通过位点(SIIB-GAUSS)量度量化的可理解性提高表明,拟议的Lombard-SSDRC TTS系统在语音形噪声(SSN)中显示出显着的相对相对改善在110%至130%之间,而在竞争言论的噪声(CSN)中,相对的相对改善(SSN)在47%至140%之间,反对竞争性的噪声(CSN)。其他主观评估表明,与基线TTS方法相比,SSN的Lombard-SSDRC TTS成功提高了语音清晰度,SSN的相对提高为455%,而CSN的相对提高为455%,中位关键字校正率为104%。
The increased adoption of digital assistants makes text-to-speech (TTS) synthesis systems an indispensable feature of modern mobile devices. It is hence desirable to build a system capable of generating highly intelligible speech in the presence of noise. Past studies have investigated style conversion in TTS synthesis, yet degraded synthesized quality often leads to worse intelligibility. To overcome such limitations, we proposed a novel transfer learning approach using Tacotron and WaveRNN based TTS synthesis. The proposed speech system exploits two modification strategies: (a) Lombard speaking style data and (b) Spectral Shaping and Dynamic Range Compression (SSDRC) which has been shown to provide high intelligibility gains by redistributing the signal energy on the time-frequency domain. We refer to this extension as Lombard-SSDRC TTS system. Intelligibility enhancement as quantified by the Intelligibility in Bits (SIIB-Gauss) measure shows that the proposed Lombard-SSDRC TTS system shows significant relative improvement between 110% and 130% in speech-shaped noise (SSN), and 47% to 140% in competing-speaker noise (CSN) against the state-of-the-art TTS approach. Additional subjective evaluation shows that Lombard-SSDRC TTS successfully increases the speech intelligibility with relative improvement of 455% for SSN and 104% for CSN in median keyword correction rate compared to the baseline TTS method.