基于源过滤器的神经声码器的混响建模

论文标题

基于源过滤器的神经声码器的混响建模

Reverberation Modeling for Source-Filter-based Neural Vocoder

论文作者

Ai, Yang, Wang, Xin, Yamagishi, Junichi, Ling, Zhen-Hua

论文摘要

本文为基于源过滤器的神经声码编码器提供了一个混响模块，可改善混响效应建模的性能。该模块使用神经声码器的输出波形作为输入，并通过将输入与房间脉冲响应（RIR）进行卷积来产生混响波形。我们提出了两种参数化和估计RIR的方法。第一种方法假定全局时间不变（GTI）RIR，并直接了解培训数据集中RIR的值。第二种方法假设了话语级时变化（UTV）RIR，它在一个话语中是不变的，但在各种话语中会有所不同，并使用另一个神经网络来预测RIR值。我们将提出的混响模块添加到Hinet Vocoder的相光谱预测器（PSP）中，并共同训练模型。实验结果表明，所提出的模块有助于建模混响效果并提高所产生的回响语音的感知质量。 UTV-RIR被证明比GTI-RIR对未知的混响条件更健壮，并获得了感知上更好的混响效应。

This paper presents a reverberation module for source-filter-based neural vocoders that improves the performance of reverberant effect modeling. This module uses the output waveform of neural vocoders as an input and produces a reverberant waveform by convolving the input with a room impulse response (RIR). We propose two approaches to parameterizing and estimating the RIR. The first approach assumes a global time-invariant (GTI) RIR and directly learns the values of the RIR on a training dataset. The second approach assumes an utterance-level time-variant (UTV) RIR, which is invariant within one utterance but varies across utterances, and uses another neural network to predict the RIR values. We add the proposed reverberation module to the phase spectrum predictor (PSP) of a HiNet vocoder and jointly train the model. Experimental results demonstrate that the proposed module was helpful for modeling the reverberation effect and improving the perceived quality of generated reverberant speech. The UTV-RIR was shown to be more robust than the GTI-RIR to unknown reverberation conditions and achieved a perceptually better reverberation effect.

下载PDF全文

下载文献需遵守相关版权规定

论文标题