使用联合交叉注意的视频融合，以识别价值的情绪空间

论文标题

使用联合交叉注意的视频融合，以识别价值的情绪空间

Audio-Visual Fusion for Emotion Recognition in the Valence-Arousal Space Using Joint Cross-Attention

论文作者

Praveen, R Gnana, Granger, Eric, Cardinal, Patrick

论文摘要

自动情绪识别（ER）最近由于其在许多实际应用中的潜力而引起了很多兴趣。在这种情况下，已经证明多模式方法可以通过结合多样化和互补的信息来源，从而提高性能（超过单峰方法），从而为嘈杂和缺失的方式提供了一些鲁棒性。在本文中，我们基于从视频中提取的面部和声音方式融合的尺寸ER，探索了互补的视听（A-V）关系，以预测个人在价值空间中的情绪状态。大多数最先进的融合技术都依赖于反复的网络或常规的注意机制，这些机制无法有效地利用A-V模式的互补性。为了解决这个问题，我们引入了一个A-V融合的联合跨注意模型，该模型在A-V模式中提取了显着特征，从而可以有效利用模式间关系，同时保留模式内关系。特别是，它根据联合特征表示与各个模式的相关性计算交叉意见权重。通过将联合A-V特征表示形式部署到交叉意见模块中，它有助于同时利用内模式和模态关系，从而显着改善系统的性能，而不是香草交叉意见模块。我们提出的方法的有效性在实验上是在Recola和affWild2数据集的挑战性视频中进行了验证。结果表明，我们的跨注意A-V融合模型提供了一种具有成本效益的解决方案，即使模式是嘈杂或不存在的，也可以超越最先进的方法。

Automatic emotion recognition (ER) has recently gained lot of interest due to its potential in many real-world applications. In this context, multimodal approaches have been shown to improve performance (over unimodal approaches) by combining diverse and complementary sources of information, providing some robustness to noisy and missing modalities. In this paper, we focus on dimensional ER based on the fusion of facial and vocal modalities extracted from videos, where complementary audio-visual (A-V) relationships are explored to predict an individual's emotional states in valence-arousal space. Most state-of-the-art fusion techniques rely on recurrent networks or conventional attention mechanisms that do not effectively leverage the complementary nature of A-V modalities. To address this problem, we introduce a joint cross-attentional model for A-V fusion that extracts the salient features across A-V modalities, that allows to effectively leverage the inter-modal relationships, while retaining the intra-modal relationships. In particular, it computes the cross-attention weights based on correlation between the joint feature representation and that of the individual modalities. By deploying the joint A-V feature representation into the cross-attention module, it helps to simultaneously leverage both the intra and inter modal relationships, thereby significantly improving the performance of the system over the vanilla cross-attention module. The effectiveness of our proposed approach is validated experimentally on challenging videos from the RECOLA and AffWild2 datasets. Results indicate that our joint cross-attentional A-V fusion model provides a cost-effective solution that can outperform state-of-the-art approaches, even when the modalities are noisy or absent.

下载PDF全文

下载文献需遵守相关版权规定

论文标题