论文标题
ISTFTNET:快速且轻巧的MEL-SPECTROGRAM SOCODER,结合了逆短期傅立叶变换
iSTFTNet: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform
论文作者
论文摘要
在最近的文本到语音综合和语音转换系统中,通常将MEL光谱图用作中间表示,并且MEL-Spectrogram Vocoder的必要性正在增加。 MEL-SPECTROGRAGION VOCODER必须解决三个反问题:原始尺度频谱图,相重建和频率转换的恢复。典型的卷积MEL-SPECTROGRAGION VOCODER直接使用卷积神经网络(包括暂时的上采样层),在直接计算原始波形时,共同和隐式地解决了这些问题。这种方法允许在波形合成过程中跳过冗余过程(例如,高维原始尺度光谱图的直接重建)。相比之下,该方法解决了黑匣子中的所有问题,并且无法有效地采用MEL光谱中存在的时频结构。因此,我们提出了ISTFTNET,该ISTFTNET在充分使用上采样层充分降低了频率尺寸之后,将Mel-SpectRogram Vocoder的某些输出侧层用逆短期傅立叶变换(ISTFT)取代,从而降低了黑盒建模的计算成本,并避免了高维光谱图的冗余估计。在实验中,我们将想法应用于三种Hifi-GAN变体,并以合理的语音质量使模型更快,更轻巧。可在https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/istftnet/上获得音频样本。
In recent text-to-speech synthesis and voice conversion systems, a mel-spectrogram is commonly applied as an intermediate representation, and the necessity for a mel-spectrogram vocoder is increasing. A mel-spectrogram vocoder must solve three inverse problems: recovery of the original-scale magnitude spectrogram, phase reconstruction, and frequency-to-time conversion. A typical convolutional mel-spectrogram vocoder solves these problems jointly and implicitly using a convolutional neural network, including temporal upsampling layers, when directly calculating a raw waveform. Such an approach allows skipping redundant processes during waveform synthesis (e.g., the direct reconstruction of high-dimensional original-scale spectrograms). By contrast, the approach solves all problems in a black box and cannot effectively employ the time-frequency structures existing in a mel-spectrogram. We thus propose iSTFTNet, which replaces some output-side layers of the mel-spectrogram vocoder with the inverse short-time Fourier transform (iSTFT) after sufficiently reducing the frequency dimension using upsampling layers, reducing the computational cost from black-box modeling and avoiding redundant estimations of high-dimensional spectrograms. During our experiments, we applied our ideas to three HiFi-GAN variants and made the models faster and more lightweight with a reasonable speech quality. Audio samples are available at https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/istftnet/.