论文标题

通过端到端的炼油和量化对比的产品量化的有效文档检索

Efficient Document Retrieval by End-to-End Refining and Quantizing BERT Embedding with Contrastive Product Quantization

论文作者

Qiu, Zexuan, Su, Qinliang, Yu, Jianxing, Si, Shijing

论文摘要

有效的文档检索在很大程度上取决于语义哈希的技术,该技术为每个文档学习了二进制代码,并采用锤距离来评估文档距离。但是,现有的语义散列方法主要是在过时的TFIDF功能上建立的,这显然不包含有关文档的许多重要语义信息。此外,锤距只能等于几个整数值之一,从而显着限制了其文档距离的代表性能力。为了解决这些问题,在本文中,我们建议利用BERT嵌入来根据产品量化技术进行有效检索,该技术将从代码书中为每个文档分配一个实用值的代码字,而不是语义哈希中的二进制代码。具体而言,我们首先通过可学习的映射转换原始的BERT嵌入,然后将转换的嵌入到概率产品量化模块中以输出分配的代码字。可以通过最大程度地减少概率对比损失来以端到端的方式优化精炼和量化模块。进一步提出了一种基于共同信息最大化的方法来提高代码字的代表性,以便可以更准确地量化文档。在三个基准上进行的广泛实验表明,我们所提出的方法的表现明显优于当前最新基准。

Efficient document retrieval heavily relies on the technique of semantic hashing, which learns a binary code for every document and employs Hamming distance to evaluate document distances. However, existing semantic hashing methods are mostly established on outdated TFIDF features, which obviously do not contain lots of important semantic information about documents. Furthermore, the Hamming distance can only be equal to one of several integer values, significantly limiting its representational ability for document distances. To address these issues, in this paper, we propose to leverage BERT embeddings to perform efficient retrieval based on the product quantization technique, which will assign for every document a real-valued codeword from the codebook, instead of a binary code as in semantic hashing. Specifically, we first transform the original BERT embeddings via a learnable mapping and feed the transformed embedding into a probabilistic product quantization module to output the assigned codeword. The refining and quantizing modules can be optimized in an end-to-end manner by minimizing the probabilistic contrastive loss. A mutual information maximization based method is further proposed to improve the representativeness of codewords, so that documents can be quantized more accurately. Extensive experiments conducted on three benchmarks demonstrate that our proposed method significantly outperforms current state-of-the-art baselines.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源