论文标题

Sparterm:基于学习术语的稀疏表示,用于快速文本检索

SparTerm: Learning Term-based Sparse Representation for Fast Text Retrieval

论文作者

Bai, Yang, Li, Xiaoguang, Wang, Gang, Zhang, Chaoliang, Shang, Lifeng, Xu, Jun, Wang, Zhaowei, Wang, Fangshan, Liu, Qun

论文摘要

基于术语的稀疏表示,由于其在效率,可解释性和确切的术语匹配方面的优势,在工业应用中占据了第一阶段文本检索。在本文中,我们研究了将预训练的语言模型(PLM)深入了解基于术语的稀疏表示的问题,旨在提高语义级匹配的单词袋(BOW)方法的表示能力,同时仍保持其优势。具体来说,我们提出了一个新颖的框架,以直接学习完整词汇空间中的稀疏文本表示形式。提出的SPARTERM包括预测词汇中每个项的重要性的重要性预测因子,以及控制术语激活的门控控制器。这两个模块合作确保了最终文本表示的稀疏性和灵活性,该模块在同一框架中统一了术语加权和扩展。在MSMARCO数据集上进行了评估,Sparterm显着胜过传统的稀疏方法,并在所有基于PLM的稀疏模型中实现了最先进的排名性能。

Term-based sparse representations dominate the first-stage text retrieval in industrial applications, due to its advantage in efficiency, interpretability, and exact term matching. In this paper, we study the problem of transferring the deep knowledge of the pre-trained language model (PLM) to Term-based Sparse representations, aiming to improve the representation capacity of bag-of-words(BoW) method for semantic-level matching, while still keeping its advantages. Specifically, we propose a novel framework SparTerm to directly learn sparse text representations in the full vocabulary space. The proposed SparTerm comprises an importance predictor to predict the importance for each term in the vocabulary, and a gating controller to control the term activation. These two modules cooperatively ensure the sparsity and flexibility of the final text representation, which unifies the term-weighting and expansion in the same framework. Evaluated on MSMARCO dataset, SparTerm significantly outperforms traditional sparse methods and achieves state of the art ranking performance among all the PLM-based sparse models.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源