论文标题

两阶段相关性排名的候选人设置的认证错误控制

Certified Error Control of Candidate Set Pruning for Two-Stage Relevance Ranking

论文作者

Li, Minghan, Zhang, Xinyu, Xin, Ji, Zhang, Hongyang, Lin, Jimmy

论文摘要

在信息检索(IR)中,通常使用候选候选人集修剪来加快两阶段相关性排名。但是,这种方法缺乏准确的误差控制,并且经常以经验方式与计算效率保持准确性,缺乏理论保证。在本文中,我们提出了对相关性排名的候选候选误差控制的概念,这意味着保证在用户指定的阈值中控制修剪后的测试错误,并具有很高的概率。内域和室外实验都表明,我们的方法成功地修剪了第一阶段的候选套件以提高第二阶段的播种速度,同时满足两种设置中预先指定的准确性约束。例如,在MS MARCO Passage V1上,我们的方法产生的平均候选设置大小为1,000分中的27分,将重新轴的速度提高约37倍,而MRR@10大于预先指定的0.38,约为90%的经验覆盖率,经验基准无法提供此类保证。代码和数据可在以下网址获得:https://github.com/alexlimh/cec-ranking。

In information retrieval (IR), candidate set pruning has been commonly used to speed up two-stage relevance ranking. However, such an approach lacks accurate error control and often trades accuracy off against computational efficiency in an empirical fashion, lacking theoretical guarantees. In this paper, we propose the concept of certified error control of candidate set pruning for relevance ranking, which means that the test error after pruning is guaranteed to be controlled under a user-specified threshold with high probability. Both in-domain and out-of-domain experiments show that our method successfully prunes the first-stage retrieved candidate sets to improve the second-stage reranking speed while satisfying the pre-specified accuracy constraints in both settings. For example, on MS MARCO Passage v1, our method yields an average candidate set size of 27 out of 1,000 which increases the reranking speed by about 37 times, while the MRR@10 is greater than a pre-specified value of 0.38 with about 90% empirical coverage and the empirical baselines fail to provide such guarantee. Code and data are available at: https://github.com/alexlimh/CEC-Ranking.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源