ISTFTNET：快速且轻巧的MEL-SPECTROGRAM SOCODER，结合了逆短期傅立叶变换

论文标题

ISTFTNET：快速且轻巧的MEL-SPECTROGRAM SOCODER，结合了逆短期傅立叶变换

iSTFTNet: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform

论文作者

Kaneko, Takuhiro, Tanaka, Kou, Kameoka, Hirokazu, Seki, Shogo

论文摘要

在最近的文本到语音综合和语音转换系统中，通常将MEL光谱图用作中间表示，并且MEL-Spectrogram Vocoder的必要性正在增加。 MEL-SPECTROGRAGION VOCODER必须解决三个反问题：原始尺度频谱图，相重建和频率转换的恢复。典型的卷积MEL-SPECTROGRAGION VOCODER直接使用卷积神经网络（包括暂时的上采样层），在直接计算原始波形时，共同和隐式地解决了这些问题。这种方法允许在波形合成过程中跳过冗余过程（例如，高维原始尺度光谱图的直接重建）。相比之下，该方法解决了黑匣子中的所有问题，并且无法有效地采用MEL光谱中存在的时频结构。因此，我们提出了ISTFTNET，该ISTFTNET在充分使用上采样层充分降低了频率尺寸之后，将Mel-SpectRogram Vocoder的某些输出侧层用逆短期傅立叶变换（ISTFT）取代，从而降低了黑盒建模的计算成本，并避免了高维光谱图的冗余估计。在实验中，我们将想法应用于三种Hifi-GAN变体，并以合理的语音质量使模型更快，更轻巧。可在https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/istftnet/上获得音频样本。

In recent text-to-speech synthesis and voice conversion systems, a mel-spectrogram is commonly applied as an intermediate representation, and the necessity for a mel-spectrogram vocoder is increasing. A mel-spectrogram vocoder must solve three inverse problems: recovery of the original-scale magnitude spectrogram, phase reconstruction, and frequency-to-time conversion. A typical convolutional mel-spectrogram vocoder solves these problems jointly and implicitly using a convolutional neural network, including temporal upsampling layers, when directly calculating a raw waveform. Such an approach allows skipping redundant processes during waveform synthesis (e.g., the direct reconstruction of high-dimensional original-scale spectrograms). By contrast, the approach solves all problems in a black box and cannot effectively employ the time-frequency structures existing in a mel-spectrogram. We thus propose iSTFTNet, which replaces some output-side layers of the mel-spectrogram vocoder with the inverse short-time Fourier transform (iSTFT) after sufficiently reducing the frequency dimension using upsampling layers, reducing the computational cost from black-box modeling and avoiding redundant estimations of high-dimensional spectrograms. During our experiments, we applied our ideas to three HiFi-GAN variants and made the models faster and more lightweight with a reasonable speech quality. Audio samples are available at https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/istftnet/.

下载PDF全文

下载文献需遵守相关版权规定

论文标题