论文标题
将科学摘要分割为话语类别:一种基于深度学习的稀疏标记数据的方法
Segmenting Scientific Abstracts into Discourse Categories: A Deep Learning-Based Approach for Sparse Labeled Data
论文作者
论文摘要
科学论文的摘要将论文的内容提炼成简短的段落。在生物医学文献中,习惯将摘要构建为背景,客观,方法,结果和结论等话语类别,但是在其他计算机科学等其他领域,这种细分并不常见。明确的类别可能有助于更多的颗粒状,即话语级别的搜索和建议。标记数据的稀疏性使构建有监督的机器学习解决方案以自动话语级别的分割非生物域中的摘要变得具有挑战性。在本文中,我们使用转移学习解决了这个问题。特别是,我们定义了三个话语类别背景,技术,观察到抽象,因为这三个类别是最常见的。我们在PubMed的结构化摘要上训练一个深层的神经网络,然后在计算机科学论文的小型手工标记的语料库中对其进行微调。我们观察到测试语料库的精度为75%。我们进行消融研究,以突出模型不同部分的作用。我们的方法似乎是对摘要自动分割的有前途的解决方案,在该摘要中,标记的数据很少。
The abstract of a scientific paper distills the contents of the paper into a short paragraph. In the biomedical literature, it is customary to structure an abstract into discourse categories like BACKGROUND, OBJECTIVE, METHOD, RESULT, and CONCLUSION, but this segmentation is uncommon in other fields like computer science. Explicit categories could be helpful for more granular, that is, discourse-level search and recommendation. The sparsity of labeled data makes it challenging to construct supervised machine learning solutions for automatic discourse-level segmentation of abstracts in non-bio domains. In this paper, we address this problem using transfer learning. In particular, we define three discourse categories BACKGROUND, TECHNIQUE, OBSERVATION-for an abstract because these three categories are the most common. We train a deep neural network on structured abstracts from PubMed, then fine-tune it on a small hand-labeled corpus of computer science papers. We observe an accuracy of 75% on the test corpus. We perform an ablation study to highlight the roles of the different parts of the model. Our method appears to be a promising solution to the automatic segmentation of abstracts, where the labeled data is sparse.