以主题为中心的无监督的多文章摘要科学和新闻文章

论文标题

以主题为中心的无监督的多文章摘要科学和新闻文章

Topic-Centric Unsupervised Multi-Document Summarization of Scientific and News Articles

论文作者

Alambo, Amanuel, Lohstroh, Cori, Madaus, Erik, Padhee, Swati, Foster, Brandy, Banerjee, Tanvi, Thirunarayan, Krishnaprasad, Raymer, Michael

论文摘要

自然语言处理的最新进展使得能够自动化广泛的任务，包括机器翻译，命名实体识别和情感分析。但是，对文档或文档组的自动汇总仍然难以捉摸，许多努力仅限于提取关键字，关键短语或关键句子的努力。由于问题的固有难度以及培训数据的可用性有限，因此尚未实现准确的抽象摘要。在本文中，我们提出了一个以主题为中心的无监督的多文章汇总框架，以在Microsoft Academic Graph（MAG）中的20个研究领域（MAG）和DUC-2004的新闻文章中为科学文章（FOS）群生成提取性和抽象性摘要，并从DUC-2004任务2中。提议的Algorithm通过开发抽象的语言单元和文学技术来开发Algorithm，并开发了一种摘要。在对自动提取性评估指标进行评估时，我们的方法与最先进的方法相匹配，并且在五个人类评估指标（构成，连贯性，简洁，可读性和语法）上进行抽象性摘要更好。我们在评估我们的结果的两位合着者语言学家之间的KAPPA得分为0.68。我们计划公开共享MAG-20，这是一个人类验证的黄金标准数据集的主题聚集研究文章及其摘要，以促进抽象性摘要中的研究。

Recent advances in natural language processing have enabled automation of a wide range of tasks, including machine translation, named entity recognition, and sentiment analysis. Automated summarization of documents, or groups of documents, however, has remained elusive, with many efforts limited to extraction of keywords, key phrases, or key sentences. Accurate abstractive summarization has yet to be achieved due to the inherent difficulty of the problem, and limited availability of training data. In this paper, we propose a topic-centric unsupervised multi-document summarization framework to generate extractive and abstractive summaries for groups of scientific articles across 20 Fields of Study (FoS) in Microsoft Academic Graph (MAG) and news articles from DUC-2004 Task 2. The proposed algorithm generates an abstractive summary by developing salient language unit selection and text generation techniques. Our approach matches the state-of-the-art when evaluated on automated extractive evaluation metrics and performs better for abstractive summarization on five human evaluation metrics (entailment, coherence, conciseness, readability, and grammar). We achieve a kappa score of 0.68 between two co-author linguists who evaluated our results. We plan to publicly share MAG-20, a human-validated gold standard dataset of topic-clustered research articles and their summaries to promote research in abstractive summarization.

下载PDF全文

下载文献需遵守相关版权规定

论文标题