视听语音编解码器：通过重新合成重新思考音频语音增强

论文标题

视听语音编解码器：通过重新合成重新思考音频语音增强

Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-Synthesis

论文作者

Yang, Karren, Markovic, Dejan, Krenn, Steven, Agrawal, Vasu, Richard, Alexander

论文摘要

由于诸如唇部运动之类的面部动作包含有关语音内容的重要信息，因此视听语音增强方法比仅仅是音频的语音更准确，这并不奇怪。然而，在充满挑战的声学环境中，最新的方法仍然难以产生清洁，逼真的语音，而没有噪音伪影和不自然的扭曲。在本文中，我们提出了一个新型的视听语音增强框架，用于AR/VR中的高保真电信。我们的方法利用视听语音提示生成神经语音编解码器的代码，从而有效地合成了来自嘈杂信号的清洁，逼真的语音。鉴于说话者特定线索在语音中的重要性，我们专注于开发适合个人演讲者的个性化模型。我们证明了我们的方法对在不受限制的大型词汇环境以及现有视听数据集中收集的新的视听语音数据集中的功效，在定量指标和人类评估研究上都优于语音增强基线。请在https://github.com/facebookresearch/facestar/releases/download/download/prape_materials/video.mp4上查看补充视频以获取定性结果。

Since facial actions such as lip movements contain significant information about speech content, it is not surprising that audio-visual speech enhancement methods are more accurate than their audio-only counterparts. Yet, state-of-the-art approaches still struggle to generate clean, realistic speech without noise artifacts and unnatural distortions in challenging acoustic environments. In this paper, we propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR. Our approach leverages audio-visual speech cues to generate the codes of a neural speech codec, enabling efficient synthesis of clean, realistic speech from noisy signals. Given the importance of speaker-specific cues in speech, we focus on developing personalized models that work well for individual speakers. We demonstrate the efficacy of our approach on a new audio-visual speech dataset collected in an unconstrained, large vocabulary setting, as well as existing audio-visual datasets, outperforming speech enhancement baselines on both quantitative metrics and human evaluation studies. Please see the supplemental video for qualitative results at https://github.com/facebookresearch/facestar/releases/download/paper_materials/video.mp4.

下载PDF全文

下载文献需遵守相关版权规定

论文标题