论文标题
通过使用Wasserstein距离来提高语音的感知损失来提高感知质量以提高语音
Improving Perceptual Quality by Phone-Fortified Perceptual Loss using Wasserstein Distance for Speech Enhancement
论文作者
论文摘要
语音增强(SE)旨在提高语音质量和清晰度,这两者都与语音细分的平稳过渡有关,语音细分可能带有语言信息,例如电话和音节。在这项研究中,我们提出了一种新型的电话实现感知损失(PFPL),该损失(PFPL)考虑了训练模型的语音信息。为了有效地包含语音信息,PFPL是根据WAV2VEC模型的潜在表示计算的,wav2Vec模型是一种强大的自我监督编码器,可提供丰富的语音信息。为了更准确地测量潜在表示的分布距离,PFPL采用了瓦斯汀距离作为距离度量。我们的实验结果首先表明,与信号级损失相比,PFPL与感知评估指标更相关。此外,结果表明,PFPL可以使深度复杂的U-NET SE模型能够在语音库数据集中的标准化质量和清晰度评估方面取得高度竞争性的性能。
Speech enhancement (SE) aims to improve speech quality and intelligibility, which are both related to a smooth transition in speech segments that may carry linguistic information, e.g. phones and syllables. In this study, we propose a novel phone-fortified perceptual loss (PFPL) that takes phonetic information into account for training SE models. To effectively incorporate the phonetic information, the PFPL is computed based on latent representations of the wav2vec model, a powerful self-supervised encoder that renders rich phonetic information. To more accurately measure the distribution distances of the latent representations, the PFPL adopts the Wasserstein distance as the distance measure. Our experimental results first reveal that the PFPL is more correlated with the perceptual evaluation metrics, as compared to signal-level losses. Moreover, the results showed that the PFPL can enable a deep complex U-Net SE model to achieve highly competitive performance in terms of standardized quality and intelligibility evaluations on the Voice Bank-DEMAND dataset.