论文标题
Meddistant19:迈向准确的基准,用于宽覆盖生物医学关系提取
MedDistant19: Towards an Accurate Benchmark for Broad-Coverage Biomedical Relation Extraction
论文作者
论文摘要
由于缺乏标记的数据和高注释成本,需要域专家,生物医学领域的关系提取是具有挑战性的。远处的监督通常用于通过将知识图形关系与原始文本相结合,以解决带注释数据的稀缺性。这样的管道容易出现噪声,并且为涵盖大量生物医学概念的规模增加了挑战。我们调查了现有的远覆盖范围远距离监督的生物医学关系提取基准,发现训练和测试关系之间的重叠范围从26%到86%。此外,我们注意到这些基准的数据构建过程中的几个不一致,并且在没有火车测试泄漏的情况下,重点是较窄的实体类型之间的相互作用。这项工作提出了更准确的基准MEDDISTANT19,用于远程覆盖的远距离监督的生物医学关系提取,以解决这些缺点,并通过将Medline摘要与广泛使用的Snomed Snomed临床术语进行对齐。缺乏针对领域特异性语言模型的彻底评估,我们还进行了实验,以验证一般领域关系提取结果对生物医学关系提取。
Relation extraction in the biomedical domain is challenging due to the lack of labeled data and high annotation costs, needing domain experts. Distant supervision is commonly used to tackle the scarcity of annotated data by automatically pairing knowledge graph relationships with raw texts. Such a pipeline is prone to noise and has added challenges to scale for covering a large number of biomedical concepts. We investigated existing broad-coverage distantly supervised biomedical relation extraction benchmarks and found a significant overlap between training and test relationships ranging from 26% to 86%. Furthermore, we noticed several inconsistencies in the data construction process of these benchmarks, and where there is no train-test leakage, the focus is on interactions between narrower entity types. This work presents a more accurate benchmark MedDistant19 for broad-coverage distantly supervised biomedical relation extraction that addresses these shortcomings and is obtained by aligning the MEDLINE abstracts with the widely used SNOMED Clinical Terms knowledge base. Lacking thorough evaluation with domain-specific language models, we also conduct experiments validating general domain relation extraction findings to biomedical relation extraction.