复杂生化命名实体识别的混合研究方法

论文标题

复杂生化命名实体识别的混合研究方法

A hybrid deep-learning approach for complex biochemical named entity recognition

论文作者

Liu, Jian, Gao, Lei, Guo, Sujie, Ding, Rui, Huang, Xin, Ye, Long, Meng, Qinghua, Nazari, Asef, Thiruvady, Dhananjay

论文摘要

化学和药物的指定实体识别（NER）是生化研究中信息提取的关键领域。 NER为生化反应中的文本挖掘提供了支持，包括实体关系提取，属性提取和代谢反应关系提取。但是，生物医学领域（例如多义和特殊角色）中复杂的命名特征的存在使NER任务非常具有挑战性。在这里，我们提出了一种混合深度学习方法，以提高NER的识别准确性。具体而言，我们的方法应用了来自变形金刚（BERT）模型的双向编码器表示，以提取文本的基本特征，通过双向长期短期记忆（BILSTM）学习文本上下文的表示，并结合了多头的注意（MHATT）机制以提取分会级别的特征。在这种方法中，Mhatt机制旨在提高缩写的识别准确性，以有效解决全文标签中不一致问题的问题。此外，有条件的随机字段（CRF）用于标记序列标签，因为这种概率方法不需要严格的独立假设，并且可以容纳任意上下文信息。对公共可公开数据集的实验评估表明，所提出的混合方法实现了最佳的识别性能。特别是，与最先进的方法相比，它在识别缩写，多频率和低频实体方面的性能大大提高了绩效。例如，与BilstM-CRF算法产生的低频实体的识别精度相比，混合方法在两个实体数据集（多重和标识符）上产生的识别精度分别增加了80％和21.69％。

Named entity recognition (NER) of chemicals and drugs is a critical domain of information extraction in biochemical research. NER provides support for text mining in biochemical reactions, including entity relation extraction, attribute extraction, and metabolic response relationship extraction. However, the existence of complex naming characteristics in the biomedical field, such as polysemy and special characters, make the NER task very challenging. Here, we propose a hybrid deep learning approach to improve the recognition accuracy of NER. Specifically, our approach applies the Bidirectional Encoder Representations from Transformers (BERT) model to extract the underlying features of the text, learns a representation of the context of the text through Bi-directional Long Short-Term Memory (BILSTM), and incorporates the multi-head attention (MHATT) mechanism to extract chapter-level features. In this approach, the MHATT mechanism aims to improve the recognition accuracy of abbreviations to efficiently deal with the problem of inconsistency in full-text labels. Moreover, conditional random field (CRF) is used to label sequence tags because this probabilistic method does not need strict independence assumptions and can accommodate arbitrary context information. The experimental evaluation on a publicly-available dataset shows that the proposed hybrid approach achieves the best recognition performance; in particular, it substantially improves performance in recognizing abbreviations, polysemes, and low-frequency entities, compared with the state-of-the-art approaches. For instance, compared with the recognition accuracies for low-frequency entities produced by the BILSTM-CRF algorithm, those produced by the hybrid approach on two entity datasets (MULTIPLE and IDENTIFIER) have been increased by 80% and 21.69%, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题