模因：执行深度重点以进行模因检测

论文标题

模因：执行深度重点以进行模因检测

MemeTector: Enforcing deep focus for meme detection

论文作者

Koutlis, Christos, Schinas, Manos, Papadopoulos, Symeon

论文摘要

图像模因，特别是它们广为人知的变体图像宏，是一种特殊的新媒体类型，将文本与图像结合在一起，并在社交媒体中使用，以嬉戏或微妙地表达幽默，讽刺，讽刺，讽刺甚至仇恨。重要的是要准确地从社交媒体中检索图像模因，以更好地捕捉在线现象的文化和社会方面并检测潜在问题（仇恨语音，虚假信息）。从本质上讲，图像宏的背景图像是人类很容易识别的常规图像，但由于特征地图与完整的图像宏相似，因此机器很麻烦。因此，在这种情况下积累合适的特征图会导致对图像模因概念的深入了解。为此，我们提出了一种称为视觉部分利用的方法，该方法将图像模因的视觉部分用作常规图像类的实例，而初始图像模因作为图像模因类的实例，以迫使模型集中于表征图像模因的关键部分。此外，我们在标准的VIT体系结构上采用了可训练的注意机制，以增强模型专注于这些关键部分并使预测可解释的能力。考虑了涉及网络被控制文本存在的涉及网络结束的常规图像的几种培训和测试场景，以根据鲁棒性和准确性评估模型。研究结果表明，在训练期间，轻型视觉部分利用与足够的文本存在相结合，提供了最佳，最强大的模型，超过了我的最新状态。源代码和数据集可从https://github.com/mever-team/memetector获得。

Image memes and specifically their widely-known variation image macros, is a special new media type that combines text with images and is used in social media to playfully or subtly express humour, irony, sarcasm and even hate. It is important to accurately retrieve image memes from social media to better capture the cultural and social aspects of online phenomena and detect potential issues (hate-speech, disinformation). Essentially, the background image of an image macro is a regular image easily recognized as such by humans but cumbersome for the machine to do so due to feature map similarity with the complete image macro. Hence, accumulating suitable feature maps in such cases can lead to deep understanding of the notion of image memes. To this end, we propose a methodology, called Visual Part Utilization, that utilizes the visual part of image memes as instances of the regular image class and the initial image memes as instances of the image meme class to force the model to concentrate on the critical parts that characterize an image meme. Additionally, we employ a trainable attention mechanism on top of a standard ViT architecture to enhance the model's ability to focus on these critical parts and make the predictions interpretable. Several training and test scenarios involving web-scraped regular images of controlled text presence are considered for evaluating the model in terms of robustness and accuracy. The findings indicate that light visual part utilization combined with sufficient text presence during training provides the best and most robust model, surpassing state of the art. Source code and dataset are available at https://github.com/mever-team/memetector.

下载PDF全文

下载文献需遵守相关版权规定

论文标题