论文标题
C-MORE:通过咨询数百万参考的审议来回答开放域问题
C-MORE: Pretraining to Answer Open-Domain Questions by Consulting Millions of References
论文作者
论文摘要
我们考虑了具有强大传输功能的两个阶段开放域问题答案(QA)系统(QA)系统(QA)系统(QA)系统(QA)系统的问题。关键的挑战是如何在没有特定特定任务注释的情况下构建大量高质量的提问 - 信用转换三胞胎。具体而言,三胞胎应通过以下方式与下游任务保持良好状态:(i)涵盖广泛的域(用于开放域应用),(ii)将问题与其语义相关的上下文与支持证据(用于培训回收者)的支持(iii)在上下文中确定正确的答案(用于培训阅读者)。以前的训练方法通常落在这些要求中的一个或多个。在这项工作中,我们通过咨询Wikipedia中引用的数百万参考,自动构建一个符合所有三个标准的大型语料库。良好的预处理信号使得猎犬和读者都受益匪浅。我们预验证的检索器在前20名精确度中可达到2%-10%的绝对增长。借助我们预审前的读者,整个系统在确切匹配中最多提高了4%。
We consider the problem of pretraining a two-stage open-domain question answering (QA) system (retriever + reader) with strong transfer capabilities. The key challenge is how to construct a large amount of high-quality question-answer-context triplets without task-specific annotations. Specifically, the triplets should align well with downstream tasks by: (i) covering a wide range of domains (for open-domain applications), (ii) linking a question to its semantically relevant context with supporting evidence (for training the retriever), and (iii) identifying the correct answer in the context (for training the reader). Previous pretraining approaches generally fall short of one or more of these requirements. In this work, we automatically construct a large-scale corpus that meets all three criteria by consulting millions of references cited within Wikipedia. The well-aligned pretraining signals benefit both the retriever and the reader significantly. Our pretrained retriever leads to 2%-10% absolute gains in top-20 accuracy. And with our pretrained reader, the entire system improves by up to 4% in exact match.