论文标题
Vistanet:可解释的多模式情感识别的视觉口语文本添加剂网
VISTANet: VIsual Spoken Textual Additive Net for Interpretable Multimodal Emotion Recognition
论文作者
论文摘要
本文提出了一个多模式情感识别系统,即视觉口语文本添加剂网(Vistanet),以将包含图像,语音和文本的输入反映的情绪分类为离散类。已经开发了一种新的可解释性技术,即K平均添加剂解释(KAAP),它标识了重要的视觉,口语和文本特征,从而预测了特定的情感类别。 Vistanet使用中间和晚期融合的混合体从图像,语音和文本方式中融合信息。它会在计算加权平均值的同时自动调整其中间输出的权重。 KAAP技术计算每种方式和相应特征在预测特定情绪类别的贡献。为了减轻带有离散情感类别标记的多模式情感数据集的不足,我们构建了由图像,相应的语音和文本以及情感标签('andagh,''happy,''happy,''hate'hate,'sad'和'sad'组成的IIT-R-R MMEMOREC数据集)。 Vistanet使用视觉,口语和文本模态在IIT-R MMEMOREC数据集上的总体情感识别精度为80.11%,表现优于单个或双模式的配置。可以在https://github.com/mintelligence-group/mmemorec访问代码和数据。
This paper proposes a multimodal emotion recognition system, VIsual Spoken Textual Additive Net (VISTANet), to classify emotions reflected by input containing image, speech, and text into discrete classes. A new interpretability technique, K-Average Additive exPlanation (KAAP), has been developed that identifies important visual, spoken, and textual features leading to predicting a particular emotion class. The VISTANet fuses information from image, speech, and text modalities using a hybrid of intermediate and late fusion. It automatically adjusts the weights of their intermediate outputs while computing the weighted average. The KAAP technique computes the contribution of each modality and corresponding features toward predicting a particular emotion class. To mitigate the insufficiency of multimodal emotion datasets labelled with discrete emotion classes, we have constructed the IIT-R MMEmoRec dataset consisting of images, corresponding speech and text, and emotion labels ('angry,' 'happy,' 'hate,' and 'sad'). The VISTANet has resulted in an overall emotion recognition accuracy of 80.11% on the IIT-R MMEmoRec dataset using visual, spoken, and textual modalities, outperforming single or dual-modality configurations. The code and data can be accessed at https://github.com/MIntelligence-Group/MMEmoRec.