标签噪声文件的信息检索按袋采样和团体损失排名

论文标题

标签噪声文件的信息检索按袋采样和团体损失排名

Information retrieval for label noise document ranking by bag sampling and group-wise loss

论文作者

Li, Chunyu, Ding, Jiajia, hu, Xing, Wang, Fan

论文摘要

长期文件检索（DR）一直是阅读理解和信息检索的巨大挑战。近年来，训练模型在检索阶段和长期文档的排名中取得了良好的成果。但是，长期文档排名中仍然存在一些关键问题，例如数据标签噪声，长文档表示，负数据不平衡采样等，以消除标记数据的噪音，并能够合理地搜索中的长文档，我们提出了袋装袋采样方法和群体范围的临近矛盾估计（LCE）方法（LCE）方法。我们使用中间尾部通道进行长文档来编码长文档，在检索中，阶段使用密集检索来生成候选人的数据。在排名阶段，检索数据分为多个袋子，每个袋中选择负样本。抽样后，合并了两个损失。第一个损失是LCE。为了很好地拟合袋子采样，在编码查询和文档后，每组的全局特征都是通过卷积层和最大泵来提取的，以提高模型对标记噪声的影响的阻力，最后计算LCE组的损失。值得注意的是，我们的模型在MARCO Long Long文档排行榜上表现出色。

Long Document retrieval (DR) has always been a tremendous challenge for reading comprehension and information retrieval. The pre-training model has achieved good results in the retrieval stage and Ranking for long documents in recent years. However, there is still some crucial problem in long document ranking, such as data label noises, long document representations, negative data Unbalanced sampling, etc. To eliminate the noise of labeled data and to be able to sample the long documents in the search reasonably negatively, we propose the bag sampling method and the group-wise Localized Contrastive Estimation(LCE) method. We use the head middle tail passage for the long document to encode the long document, and in the retrieval, stage Use dense retrieval to generate the candidate's data. The retrieval data is divided into multiple bags at the ranking stage, and negative samples are selected in each bag. After sampling, two losses are combined. The first loss is LCE. To fit bag sampling well, after query and document are encoded, the global features of each group are extracted by convolutional layer and max-pooling to improve the model's resistance to the impact of labeling noise, finally, calculate the LCE group-wise loss. Notably, our model shows excellent performance on the MS MARCO Long document ranking leaderboard.

下载PDF全文

下载文献需遵守相关版权规定

论文标题