论文标题

生物长期:较大的生物医学领域语言模型

BioMegatron: Larger Biomedical Domain Language Model

论文作者

Shin, Hoo-Chang, Zhang, Yang, Bakhturina, Evelina, Puri, Raul, Patwary, Mostofa, Shoeybi, Mohammad, Mani, Raghav

论文摘要

与在生物医学域的基准测试基准相比,在生物医学文本上预先训练的语言模型比在通用域文本语料库(例如Wikipedia和Books)上培训的语言模型在生物医学文本基准上的表现更好。但是,大多数作品并未研究影响每个领域语言应用的因素。此外,对域特异性模型上模型大小的研究主要缺少。我们经验研究和评估几个可能影响域语言应用的性能的因素,例如子字词汇集,模型大小,训练前语料库和域转移。我们通过在较大的领域语料库中训练的较大的生物长期模型对基准测试表现出一致的改进,这有助于我们对域语言模型应用的理解。我们证明了对先前最先进的(SOTA)的明显改进(SOTA)对命名实体识别,关系提取和问题答案的标准生物医学NLP基准测试。模型检查点和代码可从[https://ngc.nvidia.com]和[https://github.com/nvidia/nemo]获得。

There has been an influx of biomedical domain-specific language models, showing language models pre-trained on biomedical text perform better on biomedical domain benchmarks than those trained on general domain text corpora such as Wikipedia and Books. Yet, most works do not study the factors affecting each domain language application deeply. Additionally, the study of model size on domain-specific models has been mostly missing. We empirically study and evaluate several factors that can affect performance on domain language applications, such as the sub-word vocabulary set, model size, pre-training corpus, and domain transfer. We show consistent improvements on benchmarks with our larger BioMegatron model trained on a larger domain corpus, contributing to our understanding of domain language model applications. We demonstrate noticeable improvements over the previous state-of-the-art (SOTA) on standard biomedical NLP benchmarks of named entity recognition, relation extraction, and question answering. Model checkpoints and code are available at [https://ngc.nvidia.com] and [https://github.com/NVIDIA/NeMo].

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源