中国生物医学文本挖掘的概念化表示

论文标题

中国生物医学文本挖掘的概念化表示

Conceptualized Representation Learning for Chinese Biomedical Text Mining

论文作者

Zhang, Ningyu, Jia, Qianghuai, Yin, Kangping, Dong, Liang, Gao, Feng, Hua, Nengwei

论文摘要

随着生物医学文档和Web数据的数量迅速增长，生物医学文本挖掘变得越来越重要。最近，伯特（Bert）等单词表示模型在研究人员中广受欢迎。但是，很难估计其在包含生物医学文本的数据集上的性能，因为一般和生物医学语料库的分布一词大不相同。此外，医学领域具有长尾概念和术语，难以通过语言模型来学习。对于中国生物医学文本，由于其复杂的结构和短语组合的种类，这更加困难。在本文中，我们研究了如何将最近引入的预训练的语言模型BERT用于中国生物医学语料库，并提出一种新颖的概念化表示学习方法。我们还发布了新的中国生物医学理解理解评估基准（\ textbf {centernblue}）。我们研究了中国预培训模型的有效性：伯特，伯特 - 沃姆，罗伯塔和我们的方法。基准的实验结果表明，我们的方法可能带来巨大的收益。我们在GitHub上发布预训练的模型：https：//github.com/alibaba-research/chineseblue。

Biomedical text mining is becoming increasingly important as the number of biomedical documents and web data rapidly grows. Recently, word representation models such as BERT has gained popularity among researchers. However, it is difficult to estimate their performance on datasets containing biomedical texts as the word distributions of general and biomedical corpora are quite different. Moreover, the medical domain has long-tail concepts and terminologies that are difficult to be learned via language models. For the Chinese biomedical text, it is more difficult due to its complex structure and the variety of phrase combinations. In this paper, we investigate how the recently introduced pre-trained language model BERT can be adapted for Chinese biomedical corpora and propose a novel conceptualized representation learning approach. We also release a new Chinese Biomedical Language Understanding Evaluation benchmark (\textbf{ChineseBLUE}). We examine the effectiveness of Chinese pre-trained models: BERT, BERT-wwm, RoBERTa, and our approach. Experimental results on the benchmark show that our approach could bring significant gain. We release the pre-trained model on GitHub: https://github.com/alibaba-research/ChineseBLUE.

下载PDF全文

下载文献需遵守相关版权规定

论文标题