论文标题

多文档摘要如何“多”?

How "Multi" is Multi-Document Summarization?

论文作者

Wolhandler, Ruben, Cattan, Arie, Ernst, Ori, Dagan, Ido

论文摘要

多文档摘要(MDS)的任务旨在鉴于以多个文档为输入的模型,能够生成一个结合了分散信息的摘要,该摘要最初分布在这些文档中。因此,预计MDS数据集中的两个参考摘要以及系统摘要确实将基于此类分散信息。在本文中,我们主张量化和评估这一期望。为此,我们提出了一项自动措施,以评估摘要``分散''的程度,这是从涵盖其内容所需的源文档数量的意义上。我们将措施应用于经验分析几个受欢迎的MDS数据集,以及它们的参考摘要以及最新系统的输出。我们的结果表明,某些MDS数据集几乎不需要组合来自多个文档的信息,其中单个文档通常涵盖完整的摘要内容。总体而言,我们主张使用我们的指标评估和改善摘要数据集需要组合多文件信息的程度,同样,摘要模型如何真正满足这一挑战。我们的代码可在https://github.com/ariecattan/multi_mds中找到。

The task of multi-document summarization (MDS) aims at models that, given multiple documents as input, are able to generate a summary that combines disperse information, originally spread across these documents. Accordingly, it is expected that both reference summaries in MDS datasets, as well as system summaries, would indeed be based on such dispersed information. In this paper, we argue for quantifying and assessing this expectation. To that end, we propose an automated measure for evaluating the degree to which a summary is ``disperse'', in the sense of the number of source documents needed to cover its content. We apply our measure to empirically analyze several popular MDS datasets, with respect to their reference summaries, as well as the output of state-of-the-art systems. Our results show that certain MDS datasets barely require combining information from multiple documents, where a single document often covers the full summary content. Overall, we advocate using our metric for assessing and improving the degree to which summarization datasets require combining multi-document information, and similarly how summarization models actually meet this challenge. Our code is available in https://github.com/ariecattan/multi_mds.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源