有效的多模式变压器，具有双级特征恢复，用于鲁棒多模式分析

论文标题

有效的多模式变压器，具有双级特征恢复，用于鲁棒多模式分析

Efficient Multimodal Transformer with Dual-Level Feature Restoration for Robust Multimodal Sentiment Analysis

论文作者

Sun, Licai, Lian, Zheng, Liu, Bin, Tao, Jianhua

论文摘要

随着用户生成的在线视频的扩散，多模式情感分析（MSA）最近引起了越来越多的关注。尽管取得了重大进展，但在实现鲁棒的MSA方面仍存在两个主要挑战：1）在未对准的多模式数据中对跨模式相互作用进行建模时效率低下。 2）通常在现实设置中出现的随机模态特征的脆弱性。在本文中，我们提出了一个通用和统一的框架来解决它们，以双级特征恢复（EMT-DLFR）为有效的多模式变压器。具体而言，EMT采用了从每种模态的话语级表示作为全球多模式上下文来与局部单峰特征相互作用并相互促进。它不仅避免了以前本地局部跨模式相互作用方法的二次缩放成本，而且还可以提高性能。一方面，为了提高模型鲁棒性，DLFR执行低级功能重建，以隐式鼓励模型从不完整的数据中学习语义信息。另一方面，它是一种创新的，将完整和不完整的数据视为一个样本的两种不同视图，并利用暹罗代表学学习明确吸引其高级表示。在三个流行数据集上进行的全面实验表明，我们的方法在完整和不完整的模态设置中都能达到卓越的性能。

With the proliferation of user-generated online videos, Multimodal Sentiment Analysis (MSA) has attracted increasing attention recently. Despite significant progress, there are still two major challenges on the way towards robust MSA: 1) inefficiency when modeling cross-modal interactions in unaligned multimodal data; and 2) vulnerability to random modality feature missing which typically occurs in realistic settings. In this paper, we propose a generic and unified framework to address them, named Efficient Multimodal Transformer with Dual-Level Feature Restoration (EMT-DLFR). Concretely, EMT employs utterance-level representations from each modality as the global multimodal context to interact with local unimodal features and mutually promote each other. It not only avoids the quadratic scaling cost of previous local-local cross-modal interaction methods but also leads to better performance. To improve model robustness in the incomplete modality setting, on the one hand, DLFR performs low-level feature reconstruction to implicitly encourage the model to learn semantic information from incomplete data. On the other hand, it innovatively regards complete and incomplete data as two different views of one sample and utilizes siamese representation learning to explicitly attract their high-level representations. Comprehensive experiments on three popular datasets demonstrate that our method achieves superior performance in both complete and incomplete modality settings.

下载PDF全文

下载文献需遵守相关版权规定

论文标题