实体图增强了实例级产品检索的跨模式预处理

论文标题

实体图增强了实例级产品检索的跨模式预处理

Entity-Graph Enhanced Cross-Modal Pretraining for Instance-level Product Retrieval

论文作者

Dong, Xiao, Zhan, Xunlin, Wei, Yunchao, Wei, Xiaoyong, Wang, Yaowei, Lu, Minlong, Cao, Xiaochun, Liang, Xiaodan

论文摘要

我们在这项研究中的目标是研究一个更现实的环境，在这种环境中，我们可以为细粒度产品类别进行弱监督的多模式实例级产品检索。我们首先贡献了product1m数据集，并定义了两个实际实例级检索任务，以实现价格比较和个性化建议的评估。对于两个实例级任务，如何准确地指出视觉语言数据中提到的产品目标并有效地降低了无关紧要的内容的影响非常具有挑战性。为了解决这个问题，我们利用训练更有效的跨模式模型，该模型能够适应能够通过使用一个实体图，该图形分别表示实体和边缘，该图形分别表示实体和实体之间的相似性。 Specifically, a novel Entity-Graph Enhanced Cross-Modal Pretraining (EGE-CMP) model is proposed for instance-level commodity retrieval, that explicitly injects entity knowledge in both node-based and subgraph-based ways into the multi-modal networks via a self-supervised hybrid-stream transformer, which could reduce the confusion between different object contents, thereby effectively guiding the network to focus on entities with real semantic.实验结果很好地验证了我们的EGE-CMP的功效和概括性，表现优于几个SOTA跨模式基线，例如夹子，Uniter和Capture。

Our goal in this research is to study a more realistic environment in which we can conduct weakly-supervised multi-modal instance-level product retrieval for fine-grained product categories. We first contribute the Product1M datasets, and define two real practical instance-level retrieval tasks to enable the evaluations on the price comparison and personalized recommendations. For both instance-level tasks, how to accurately pinpoint the product target mentioned in the visual-linguistic data and effectively decrease the influence of irrelevant contents is quite challenging. To address this, we exploit to train a more effective cross-modal pertaining model which is adaptively capable of incorporating key concept information from the multi-modal data, by using an entity graph whose node and edge respectively denote the entity and the similarity relation between entities. Specifically, a novel Entity-Graph Enhanced Cross-Modal Pretraining (EGE-CMP) model is proposed for instance-level commodity retrieval, that explicitly injects entity knowledge in both node-based and subgraph-based ways into the multi-modal networks via a self-supervised hybrid-stream transformer, which could reduce the confusion between different object contents, thereby effectively guiding the network to focus on entities with real semantic. Experimental results well verify the efficacy and generalizability of our EGE-CMP, outperforming several SOTA cross-modal baselines like CLIP, UNITER and CAPTURE.

下载PDF全文

下载文献需遵守相关版权规定

论文标题