对多模式文本和图像数据的情感反馈综合

论文标题

对多模式文本和图像数据的情感反馈综合

Affective Feedback Synthesis Towards Multimodal Text and Image Data

论文作者

Kumar, Puneet, Bhat, Gaurav, Ingle, Omkar, Goyal, Daksh, Raman, Balasubramanian

论文摘要

在本文中，我们定义了一项新颖的情感反馈综合任务，该任务涉及对输入文本和相应图像的反馈，就像人类对多模式数据的响应一样。已经提出了反馈综合系统，并使用地面真实人类评论以及图像文本输入对反馈综合系统进行了培训。我们还构建了一个大规模数据集，该数据集由图像，文本，Twitter用户评论以及通过Twitter feed抓取新闻文章的评论的数量。提出的系统使用基于变压器的文本编码器提取文本功能，而使用基于区域的卷积神经网络模型提取视觉特征。文本和视觉特征已被串联以使用解码器综合反馈来构建多模式的特征。我们已经使用定量和定性措施将所提出系统的结果与基线模型进行了比较。已经使用自动和人类评估对生成的反馈进行了分析。已经发现它们在语义上与基本真相评论相似，并且与给定的文本图像输入相关。

In this paper, we have defined a novel task of affective feedback synthesis that deals with generating feedback for input text & corresponding image in a similar way as humans respond towards the multimodal data. A feedback synthesis system has been proposed and trained using ground-truth human comments along with image-text input. We have also constructed a large-scale dataset consisting of image, text, Twitter user comments, and the number of likes for the comments by crawling the news articles through Twitter feeds. The proposed system extracts textual features using a transformer-based textual encoder while the visual features have been extracted using a Faster region-based convolutional neural networks model. The textual and visual features have been concatenated to construct the multimodal features using which the decoder synthesizes the feedback. We have compared the results of the proposed system with the baseline models using quantitative and qualitative measures. The generated feedbacks have been analyzed using automatic and human evaluation. They have been found to be semantically similar to the ground-truth comments and relevant to the given text-image input.

下载PDF全文

下载文献需遵守相关版权规定

论文标题