Soundchoice：具有语义歧义的谱系到音量模型

论文标题

Soundchoice：具有语义歧义的谱系到音量模型

SoundChoice: Grapheme-to-Phoneme Models with Semantic Disambiguation

论文作者

Ploujnikov, Artem, Ravanelli, Mirco

论文摘要

端到端语音合成模型将输入字符直接转换为音频表示（例如频谱图）。尽管表现令人印象深刻，但这种模型仍很难掩盖相同拼写单词的发音。为了减轻此问题，可以在合成音频之前将单独的字体到phoneme（G2P）模型转换为音素。本文提出了SoundChoice，这是一种新颖的G2P架构，可以处理整个句子而不是在单词层面上操作。所提出的体系结构利用加权同构损失（改善了歧义），利用课程学习（逐渐从单词级别切换到句子级别的G2P），并整合了Bert的单词嵌入（以进一步提高性能提高）。此外，该模型在语音识别中继承了最佳实践，包括使用Connectionist Perimal分类（CTC）的多任务学习和嵌入式语言模型的Beam搜索。结果，SoundChoice使用Librispeech和Wikipedia的数据实现了全句转录的音素错误率（PER），为2.65％。索引术语字素至音量，语音综合，文本传播，语音，发音，歧义。

End-to-end speech synthesis models directly convert the input characters into an audio representation (e.g., spectrograms). Despite their impressive performance, such models have difficulty disambiguating the pronunciations of identically spelled words. To mitigate this issue, a separate Grapheme-to-Phoneme (G2P) model can be employed to convert the characters into phonemes before synthesizing the audio. This paper proposes SoundChoice, a novel G2P architecture that processes entire sentences rather than operating at the word level. The proposed architecture takes advantage of a weighted homograph loss (that improves disambiguation), exploits curriculum learning (that gradually switches from word-level to sentence-level G2P), and integrates word embeddings from BERT (for further performance improvement). Moreover, the model inherits the best practices in speech recognition, including multi-task learning with Connectionist Temporal Classification (CTC) and beam search with an embedded language model. As a result, SoundChoice achieves a Phoneme Error Rate (PER) of 2.65% on whole-sentence transcription using data from LibriSpeech and Wikipedia. Index Terms grapheme-to-phoneme, speech synthesis, text-tospeech, phonetics, pronunciation, disambiguation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题