论文标题
通过波格表综合的超声基于超声的发音到声学映射
Ultrasound-based Articulatory-to-Acoustic Mapping with WaveGlow Speech Synthesis
论文作者
论文摘要
为了使用深层神经网络的发音到声学映射,通常将声码器的光谱和激发参数用作训练目标。但是,辅助作业通常会导致嗡嗡作响和闷闷不乐的最终语音质量。因此,在有关基于超声的发音到声学转换的论文中,我们使用基于流的神经声码器(WaveLow)对大量英语和匈牙利语音数据进行了预先培训。卷积神经网络的输入是超声舌图像。训练目标是80维MEL光谱图,它比先前使用的25维MELE酸化的Cepstrum所产生的详细光谱表示。从超声波到摩托线图预测的输出中,波格曲线推断会导致综合语音。我们将所提出的基于波浪曲线的系统与连续的Vocoder进行了比较,该系统在预测F0时不会使用严格的声音/未发声决定。结果表明,在发音到声学映射实验期间,波格音乐神经声码器比基线系统产生的自然合成语音明显更高。此外,波格洛的优点是F0包含在MEL光谱图表示中,并且不必分别预测激发。
For articulatory-to-acoustic mapping using deep neural networks, typically spectral and excitation parameters of vocoders have been used as the training targets. However, vocoding often results in buzzy and muffled final speech quality. Therefore, in this paper on ultrasound-based articulatory-to-acoustic conversion, we use a flow-based neural vocoder (WaveGlow) pre-trained on a large amount of English and Hungarian speech data. The inputs of the convolutional neural network are ultrasound tongue images. The training target is the 80-dimensional mel-spectrogram, which results in a finer detailed spectral representation than the previously used 25-dimensional Mel-Generalized Cepstrum. From the output of the ultrasound-to-mel-spectrogram prediction, WaveGlow inference results in synthesized speech. We compare the proposed WaveGlow-based system with a continuous vocoder which does not use strict voiced/unvoiced decision when predicting F0. The results demonstrate that during the articulatory-to-acoustic mapping experiments, the WaveGlow neural vocoder produces significantly more natural synthesized speech than the baseline system. Besides, the advantage of WaveGlow is that F0 is included in the mel-spectrogram representation, and it is not necessary to predict the excitation separately.