学习术语歧视

论文标题

学习术语歧视

Learning Term Discrimination

论文作者

Frej, Jibril, Mulhem, Phillipe, Schwab, Didier, Chevallet, Jean-Pierre

论文摘要

文档索引是有效信息检索（IR）的关键组件。在诸如Stemming和Stop-word删除之类的预处理步骤之后，文档索引通常存储术语频率（TF）。与TF（仅反映文档中术语的重要性）一起，传统的IR模型还使用术语歧视值（TDV），例如逆文档频率（IDF），以在检索过程中偏爱判别术语。在这项工作中，我们建议通过浅层神经网络学习用于文档索引的TDV，该神经网络近似于传统的IR排名功能，例如TF-IDF和BM25。我们的建议在NDCG和召回方面的表现都优于传统方法，即使很少有标记的查询文档对作为学习数据。当用来滤除具有零歧视值的词汇术语时，我们学到的TDV既可以显着降低倒置索引的内存足迹，又可以加快检索过程（BM25的速度更快3倍），而不会降低检索质量。

Document indexing is a key component for efficient information retrieval (IR). After preprocessing steps such as stemming and stop-word removal, document indexes usually store term-frequencies (tf). Along with tf (that only reflects the importance of a term in a document), traditional IR models use term discrimination values (TDVs) such as inverse document frequency (idf) to favor discriminative terms during retrieval. In this work, we propose to learn TDVs for document indexing with shallow neural networks that approximate traditional IR ranking functions such as TF-IDF and BM25. Our proposal outperforms, both in terms of nDCG and recall, traditional approaches, even with few positively labelled query-document pairs as learning data. Our learned TDVs, when used to filter out terms of the vocabulary that have zero discrimination value, allow to both significantly lower the memory footprint of the inverted index and speed up the retrieval process (BM25 is up to 3~times faster), without degrading retrieval quality.

下载PDF全文

下载文献需遵守相关版权规定

论文标题