论文标题

用于生物医学键形生成的大型数据集

A Large-Scale Dataset for Biomedical Keyphrase Generation

论文作者

Houbre, Mael, Boudin, Florian, Daille, Beatrice

论文摘要

KeyPhrase生成是一个任务,其中包括生成一组单词或短语,这些单词或短语突出了文档的主要主题。在生物医学领域中,很少有用于键形生成的数据集,并且它们对训练生成模型的规模不满意。在本文中,我们介绍了KP-BIOMED,这是第一个大规模生物医学钥匙纸生成数据集,并从PubMed摘要中收集了超过500万个文档。我们训练并释放几种生成模型,并进行一系列实验,表明使用大型数据集可显着改善当前和缺乏键形生成的性能。该数据集可在https://huggingface.co/ datasets/taln-ls2n/kpbiomed的CC-BY-NC V4.0许可下获得。

Keyphrase generation is the task consisting in generating a set of words or phrases that highlight the main topics of a document. There are few datasets for keyphrase generation in the biomedical domain and they do not meet the expectations in terms of size for training generative models. In this paper, we introduce kp-biomed, the first large-scale biomedical keyphrase generation dataset with more than 5M documents collected from PubMed abstracts. We train and release several generative models and conduct a series of experiments showing that using large scale datasets improves significantly the performances for present and absent keyphrase generation. The dataset is available under CC-BY-NC v4.0 license at https://huggingface.co/ datasets/taln-ls2n/kpbiomed.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源