论文标题

生物阿尔伯特:生物医学命名实体识别的简单有效的预训练的语言模型

BioALBERT: A Simple and Effective Pre-trained Language Model for Biomedical Named Entity Recognition

论文作者

Naseem, Usman, Khushi, Matloob, Reddy, Vinay, Rajendran, Sakthivel, Razzak, Imran, Kim, Jinman

论文摘要

近年来,随着生物医学文档的量越来越多,再加上自然语言处理算法的进步,生物医学指定实体识别(Bioner)的研究呈指数增长。但是,Bioner Research在生物医学领域中的NER是挑战性的:(i)由于培训数据的数量有限,通常受到限制,(ii)实体可以根据其上下文来指多种类型和概念,并且(III)对特定于特定于特定于特定组的首字母缩略词的严重依赖。现有的Bioner方法通常忽略了这些问题,并直接采用了在一般语料库中培训的最新模型(SOTA)模型,这些模型通常会产生不令人满意的结果。我们提出了生物医学Albert(来自生物医学文本挖掘变压器的精简双向编码器)Bioalbert,这是一种有效的领域特异性语言模型,该模型旨在培训用于捕获生物医学背景依赖性NER的大规模生物医学语料库。我们采用了阿尔伯特(Albert)使用的自我监督损失,重点是建模句子间相干性,以更好地学习上下文依赖性表示,并结合了减少参数的技术来降低记忆消耗并提高Bioner的训练速度。在我们的实验中,Bioalbert在八个具有四种不同实体类型的生物医学基准数据集上优于比较SOTA Bioner模型。我们培训了生物阿尔伯特模型的四种不同变体,这些模型可用于研究社区未来的研究。

In recent years, with the growing amount of biomedical documents, coupled with advancement in natural language processing algorithms, the research on biomedical named entity recognition (BioNER) has increased exponentially. However, BioNER research is challenging as NER in the biomedical domain are: (i) often restricted due to limited amount of training data, (ii) an entity can refer to multiple types and concepts depending on its context and, (iii) heavy reliance on acronyms that are sub-domain specific. Existing BioNER approaches often neglect these issues and directly adopt the state-of-the-art (SOTA) models trained in general corpora which often yields unsatisfactory results. We propose biomedical ALBERT (A Lite Bidirectional Encoder Representations from Transformers for Biomedical Text Mining) bioALBERT, an effective domain-specific language model trained on large-scale biomedical corpora designed to capture biomedical context-dependent NER. We adopted a self-supervised loss used in ALBERT that focuses on modelling inter-sentence coherence to better learn context-dependent representations and incorporated parameter reduction techniques to lower memory consumption and increase the training speed in BioNER. In our experiments, BioALBERT outperformed comparative SOTA BioNER models on eight biomedical NER benchmark datasets with four different entity types. We trained four different variants of BioALBERT models which are available for the research community to be used in future research.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源