当地专家的稀疏混合物，以提高语音

论文标题

当地专家的稀疏混合物，以提高语音

Sparse Mixture of Local Experts for Efficient Speech Enhancement

论文作者

Sivaraman, Aswin, Kim, Minje

论文摘要

在本文中，我们调查了一种通过专业神经网络的有效合奏进行的深度学习方法。通过将语音剥夺任务分为非重叠的子问题并引入分类器，我们能够提高转化性能，同时降低计算复杂性。更具体地说，提出的模型结合了一个门控网络，该网络将基于语音退化级别或说话者性别的适当专家网络分配嘈杂的语音信号。在我们的实验中，将基线复发网络与由辅助门控网络调节的类似设计的较小复发网络进行了比较。使用从大型嘈杂的语音语料库中随机生成的批处理，提出的模型学会了基于输入混合物信号的幅度光谱图估算时间频率掩盖矩阵。基线和专业网络均经过培训以估计理想比率掩码，而门控网络则经过训练以执行子问题分类。我们的发现表明，微调的集合网络能够超过通才网络的语音剥夺能力，并使用较少的模型参数进行。

In this paper, we investigate a deep learning approach for speech denoising through an efficient ensemble of specialist neural networks. By splitting up the speech denoising task into non-overlapping subproblems and introducing a classifier, we are able to improve denoising performance while also reducing computational complexity. More specifically, the proposed model incorporates a gating network which assigns noisy speech signals to an appropriate specialist network based on either speech degradation level or speaker gender. In our experiments, a baseline recurrent network is compared against an ensemble of similarly-designed smaller recurrent networks regulated by the auxiliary gating network. Using stochastically generated batches from a large noisy speech corpus, the proposed model learns to estimate a time-frequency masking matrix based on the magnitude spectrogram of an input mixture signal. Both baseline and specialist networks are trained to estimate the ideal ratio mask, while the gating network is trained to perform subproblem classification. Our findings demonstrate that a fine-tuned ensemble network is able to exceed the speech denoising capabilities of a generalist network, doing so with fewer model parameters.

下载PDF全文

下载文献需遵守相关版权规定

论文标题