论文标题
共同搜索:COVID-19通过语义搜索,问答和抽象摘要的信息检索
CO-Search: COVID-19 Information Retrieval with Semantic Search, Question Answering, and Abstractive Summarization
论文作者
论文摘要
COVID-19全球大流行导致了国际努力来理解,追踪和减轻该疾病,从而在科学学科中产生了大量的Covid-19和与SARS-COV-2相关的出版物。截至2020年5月,通过COVID-19-19开放研究数据集挑战收集了128,000个与冠状病毒相关的出版物。在这里,我们提出了共同搜索,这是一种旨在处理COVID-19文献中复杂查询的猎犬级语义搜索引擎,在危机时期,可能有助于卫生工作者在寻找科学答案方面负担过度负担。该猎犬是由暹罗伯特编码器构建的,该编码器与TF-IDF矢量机线性构成,并与BM25矢量器融合在一起。排名者由多跳的问答模块组成,该模块与多段的抽象摘要调整了回猎犬分数一起。为了说明特定于域的数据集且相对有限的数据集,我们生成了文档段落和引文的两部分图,创建了130万(引用标题,段落),用于培训编码器。我们根据TREC-COVID信息检索挑战的数据评估我们的系统。共同搜索在第一轮和第二轮的数据集中获得了最高的性能,这些指标是:归一化折扣累积增益,精度,平均平均精度和二进制优先偏好。
The COVID-19 global pandemic has resulted in international efforts to understand, track, and mitigate the disease, yielding a significant corpus of COVID-19 and SARS-CoV-2-related publications across scientific disciplines. As of May 2020, 128,000 coronavirus-related publications have been collected through the COVID-19 Open Research Dataset Challenge. Here we present CO-Search, a retriever-ranker semantic search engine designed to handle complex queries over the COVID-19 literature, potentially aiding overburdened health workers in finding scientific answers during a time of crisis. The retriever is built from a Siamese-BERT encoder that is linearly composed with a TF-IDF vectorizer, and reciprocal-rank fused with a BM25 vectorizer. The ranker is composed of a multi-hop question-answering module, that together with a multi-paragraph abstractive summarizer adjust retriever scores. To account for the domain-specific and relatively limited dataset, we generate a bipartite graph of document paragraphs and citations, creating 1.3 million (citation title, paragraph) tuples for training the encoder. We evaluate our system on the data of the TREC-COVID information retrieval challenge. CO-Search obtains top performance on the datasets of the first and second rounds, across several key metrics: normalized discounted cumulative gain, precision, mean average precision, and binary preference.