我们可以阅读嘴唇以外的演讲吗？重新思考ROI选择以进行深层视觉识别

论文标题

我们可以阅读嘴唇以外的演讲吗？重新思考ROI选择以进行深层视觉识别

Can We Read Speech Beyond the Lips? Rethinking RoI Selection for Deep Visual Speech Recognition

论文作者

Zhang, Yuanhang, Yang, Shuang, Xiao, Jingyun, Shan, Shiguang, Chen, Xilin

论文摘要

深度学习的最新进展提高了研究人员在视觉语音识别领域（VSR）的兴趣。当前，大多数现有的方法将VSR等同于自动唇读，该方法试图通过分析唇部运动来识别语音。但是，人类的经验和心理学研究表明，在面对面的对话中，我们并不总是凝视彼此的嘴唇，而是重复地扫描整个脸部。这激发了我们重新审视一个基本但以某种方式忽略的问题：VSR模型能否从阅读外地面部区域（即超越嘴唇）中受益？在本文中，我们进行了一项全面的研究，以评估不同面部区域的影响，其中包括嘴巴，整个面部，上面甚至脸颊在内的最先进的VSR模型。实验是在具有不同特征的单词级和句子级基准上进行的。我们发现，尽管数据发生了复杂的差异，包括来自外面面部区域的信息，甚至是上面的，始终如一地使VSR性能受益。此外，我们引入了一种基于切口的简单而有效的方法，以了解基于面部VSR的更多判别特征，以期最大程度地提高在不同面部区域中编码的信息的实用性。我们的实验表明，仅将唇部区域用作输入的现有最新方法明显改善，我们认为这可能会为VSR社区提供一些新的令人兴奋的见解。

Recent advances in deep learning have heightened interest among researchers in the field of visual speech recognition (VSR). Currently, most existing methods equate VSR with automatic lip reading, which attempts to recognise speech by analysing lip motion. However, human experience and psychological studies suggest that we do not always fix our gaze at each other's lips during a face-to-face conversation, but rather scan the whole face repetitively. This inspires us to revisit a fundamental yet somehow overlooked problem: can VSR models benefit from reading extraoral facial regions, i.e. beyond the lips? In this paper, we perform a comprehensive study to evaluate the effects of different facial regions with state-of-the-art VSR models, including the mouth, the whole face, the upper face, and even the cheeks. Experiments are conducted on both word-level and sentence-level benchmarks with different characteristics. We find that despite the complex variations of the data, incorporating information from extraoral facial regions, even the upper face, consistently benefits VSR performance. Furthermore, we introduce a simple yet effective method based on Cutout to learn more discriminative features for face-based VSR, hoping to maximise the utility of information encoded in different facial regions. Our experiments show obvious improvements over existing state-of-the-art methods that use only the lip region as inputs, a result we believe would probably provide the VSR community with some new and exciting insights.

下载PDF全文

下载文献需遵守相关版权规定

论文标题