论文标题

关于视觉提示在视听语音增强中的作用

On the Role of Visual Cues in Audiovisual Speech Enhancement

论文作者

Aldeneh, Zakaria, Kumar, Anushree Prasanna, Theobald, Barry-John, Marchi, Erik, Kajarekar, Sachin, Naik, Devang, Abdelaziz, Ahmed Hussen

论文摘要

我们提出了视听语音增强模型的内省。特别是,我们专注于解释神经视听语音增强模型如何使用视觉提示来提高目标语音信号的质量。我们表明,视觉提示不仅提供有关语音活动的高级信息,即语音/沉默,还提供有关关节位置的细粒度视觉信息。该发现的一个副产品是,学到的视觉嵌入可以用作其他视觉语音应用程序的功能。我们证明了学习的视觉嵌入对分类观察的有效性(与音素的视觉类比)。我们的结果提供了对视听语音增强的重要方面的洞察力,并证明了如何将这些模型用于视觉语音应用的自学任务。

We present an introspection of an audiovisual speech enhancement model. In particular, we focus on interpreting how a neural audiovisual speech enhancement model uses visual cues to improve the quality of the target speech signal. We show that visual cues provide not only high-level information about speech activity, i.e., speech/silence, but also fine-grained visual information about the place of articulation. One byproduct of this finding is that the learned visual embeddings can be used as features for other visual speech applications. We demonstrate the effectiveness of the learned visual embeddings for classifying visemes (the visual analogy to phonemes). Our results provide insight into important aspects of audiovisual speech enhancement and demonstrate how such models can be used for self-supervision tasks for visual speech applications.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源