论文标题
Linkbert:带有文档链接的训练语言模型
LinkBERT: Pretraining Language Models with Document Links
论文作者
论文摘要
语言模型(LM)预处理可以从文本语料库中学习各种知识,并帮助下游任务。但是,现有的方法,例如BERT模型单个文档,并且不会捕获跨文档的依赖性或知识。在这项工作中,我们提出了Linkbert,这是一种利用文档(例如超链接)之间的LM预处理方法。给定文本语料库,我们将其视为文档的图表,并通过在同一上下文中放置链接的文档来创建LM输入。然后,我们用两个联合自我监督的目标为LM预先介绍:掩盖语言建模和我们的新建议,文档关系预测。我们表明,Linkbert在两个领域的各种下游任务上的表现要优于BERT:通用域(在Wikipedia上预先使用超链接)和生物医学领域(在PubMed上预处理带有引用链接)。 Linkbert对于多跳的推理特别有效,很少有QA(+5%的HotPotQA和Triviaqa的绝对改进),我们的生物医学Linkbert在各种Bionlp任务上设置了新的艺术状态(BioASQ和USMLE的+7%)。我们在https://github.com/michiyasunaga/linkbert上发布了验证的模型,Linkbert和Biolinkbert,以及代码和数据。
Language model (LM) pretraining can learn various knowledge from text corpora, helping downstream tasks. However, existing methods such as BERT model a single document, and do not capture dependencies or knowledge that span across documents. In this work, we propose LinkBERT, an LM pretraining method that leverages links between documents, e.g., hyperlinks. Given a text corpus, we view it as a graph of documents and create LM inputs by placing linked documents in the same context. We then pretrain the LM with two joint self-supervised objectives: masked language modeling and our new proposal, document relation prediction. We show that LinkBERT outperforms BERT on various downstream tasks across two domains: the general domain (pretrained on Wikipedia with hyperlinks) and biomedical domain (pretrained on PubMed with citation links). LinkBERT is especially effective for multi-hop reasoning and few-shot QA (+5% absolute improvement on HotpotQA and TriviaQA), and our biomedical LinkBERT sets new states of the art on various BioNLP tasks (+7% on BioASQ and USMLE). We release our pretrained models, LinkBERT and BioLinkBERT, as well as code and data at https://github.com/michiyasunaga/LinkBERT.