通过通用语言模型为非英语医学NLP创建注释的数据集创建

论文标题

通过通用语言模型为非英语医学NLP创建注释的数据集创建

Annotated Dataset Creation through General Purpose Language Models for non-English Medical NLP

论文作者

Frei, Johann, Kramer, Frank

论文摘要

获得具有语义注释的文本数据集是一个艰苦的过程，但对于自然语言过程（NLP）的监督培训至关重要。通常，在特定于域的上下文中开发和应用新的NLP管道通常需要定制设计的数据集来以监督机器学习方式解决NLP任务。当使用非英语语言进行医学数据处理时，这会暴露出几个次要和主要的相互联系的问题，例如缺乏任务匹配数据集以及特定于任务的预训练模型。在我们的工作中，我们建议利用审计的语言模型进行培训数据获取，以便检索足够大的数据集，以训练更小，更有效的模型以进行用例特定的任务。为了证明您的方法的有效性，我们创建了一个自定义数据集，我们用来培训用于德国文本的医学模型，但原则上我们的方法仍然不依赖语言。我们获得的数据集以及我们的预培训模型可在以下网址公开获取：https：//github.com/frankkramer-lab/gptnermed

Obtaining text datasets with semantic annotations is an effortful process, yet crucial for supervised training in natural language processsing (NLP). In general, developing and applying new NLP pipelines in domain-specific contexts for tasks often requires custom designed datasets to address NLP tasks in supervised machine learning fashion. When operating in non-English languages for medical data processing, this exposes several minor and major, interconnected problems such as lack of task-matching datasets as well as task-specific pre-trained models. In our work we suggest to leverage pretrained language models for training data acquisition in order to retrieve sufficiently large datasets for training smaller and more efficient models for use-case specific tasks. To demonstrate the effectiveness of your approach, we create a custom dataset which we use to train a medical NER model for German texts, GPTNERMED, yet our method remains language-independent in principle. Our obtained dataset as well as our pre-trained models are publicly available at: https://github.com/frankkramer-lab/GPTNERMED

下载PDF全文

下载文献需遵守相关版权规定

论文标题