论文标题
社交媒体上指定实体识别的多个跨模式表示学习
Multi-Granularity Cross-Modality Representation Learning for Named Entity Recognition on Social Media
论文作者
论文摘要
社交媒体上指定的实体识别(NER)是指从非结构化的自由形式内容中发现和分类实体,并且它对诸如意图理解和用户建议等各种应用程序起着重要作用。随着社交媒体帖子倾向于多模式,具有多模式的命名实体识别(MNER)及其随附图像的文本吸引了越来越多的关注,因为只能与视觉信息结合使用某些文本组件。但是,现有方法中有两个缺点:1)文本的含义及其随附的图像不始终匹配,因此文本信息仍然起着重要作用。但是,与其他正常内容相比,社交媒体帖子通常更短,更非正式,这很容易导致不完整的语义描述和数据稀疏问题。 2)尽管已经使用了整个图像或对象的视觉表示,但现有方法忽略了图像和文本中的对象之间的细颗粒语义对应关系,或者是客观的事实,即某些图像中存在误导对象或没有对象。在这项工作中,我们通过引入多个跨模式表示学习来解决上述两个问题。为了解决第一个问题,我们通过文本中每个单词的语义增强来增强表示形式。至于第二个问题,我们在不同的视觉粒度上执行文本和视觉之间的跨模式语义互动,以获取每个单词最有效的多模式指导表示。实验表明,我们提出的方法可以在两个推文基准数据集上实现SOTA或近似SOTA性能。代码,数据和最佳性能模型可在https://github.com/liupeip-cs/iie4mner上找到
Named Entity Recognition (NER) on social media refers to discovering and classifying entities from unstructured free-form content, and it plays an important role for various applications such as intention understanding and user recommendation. With social media posts tending to be multimodal, Multimodal Named Entity Recognition (MNER) for the text with its accompanying image is attracting more and more attention since some textual components can only be understood in combination with visual information. However, there are two drawbacks in existing approaches: 1) Meanings of the text and its accompanying image do not match always, so the text information still plays a major role. However, social media posts are usually shorter and more informal compared with other normal contents, which easily causes incomplete semantic description and the data sparsity problem. 2) Although the visual representations of whole images or objects are already used, existing methods ignore either fine-grained semantic correspondence between objects in images and words in text or the objective fact that there are misleading objects or no objects in some images. In this work, we solve the above two problems by introducing the multi-granularity cross-modality representation learning. To resolve the first problem, we enhance the representation by semantic augmentation for each word in text. As for the second issue, we perform the cross-modality semantic interaction between text and vision at the different vision granularity to get the most effective multimodal guidance representation for every word. Experiments show that our proposed approach can achieve the SOTA or approximate SOTA performance on two benchmark datasets of tweets. The code, data and the best performing models are available at https://github.com/LiuPeiP-CS/IIE4MNER