生物医学命名实体识别使用机器阅读理解框架中的BERT

论文标题

生物医学命名实体识别使用机器阅读理解框架中的BERT

Biomedical named entity recognition using BERT in the machine reading comprehension framework

论文作者

Sun, Cong, Yang, Zhihao, Wang, Lei, Zhang, Yin, Lin, Hongfei, Wang, Jian

论文摘要

对文献的生物医学实体的认识是一个充满挑战的研究重点，这是将非结构化文本中存在的大量生物医学知识提取为结构化格式的基础。使用序列标签框架来实现生物医学命名实体识别（Bioner）当前是一种常规方法。但是，这种方法通常无法充分利用数据集中的语义信息，并且性能并不总是令人满意的。在这项工作中，我们没有将Bioner任务视为序列标记问题，而是将其作为机器阅读理解（MRC）问题提出。该公式可以利用精心设计的查询引入更多的先验知识，并且不再需要解码过程，例如条件随机字段（CRF）。我们在六个Bioner数据集上进行实验，实验结果证明了我们方法的有效性。我们的方法在BC4CHEMD，BC5CDR-CHEM，BC5CDR-DISESE，NCBI-DISESE，BC2GM和JNLPBA数据集上实现了最先进的性能（SOTA），可实现92.92％，94.19％，87.83％和90.85％和90.85％的F1分数，达到92.92％的F1分数

Recognition of biomedical entities from literature is a challenging research focus, which is the foundation for extracting a large amount of biomedical knowledge existing in unstructured texts into structured formats. Using the sequence labeling framework to implement biomedical named entity recognition (BioNER) is currently a conventional method. This method, however, often cannot take full advantage of the semantic information in the dataset, and the performance is not always satisfactory. In this work, instead of treating the BioNER task as a sequence labeling problem, we formulate it as a machine reading comprehension (MRC) problem. This formulation can introduce more prior knowledge utilizing well-designed queries, and no longer need decoding processes such as conditional random fields (CRF). We conduct experiments on six BioNER datasets, and the experimental results demonstrate the effectiveness of our method. Our method achieves state-of-the-art (SOTA) performance on the BC4CHEMD, BC5CDR-Chem, BC5CDR-Disease, NCBI-Disease, BC2GM and JNLPBA datasets, achieving F1-scores of 92.92%, 94.19%, 87.83%, 90.04%, 85.48% and 78.93%, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题