论文标题
尖叫:构建一个艰难的方法
SQuALITY: Building a Long-Document Summarization Dataset the Hard Way
论文作者
论文摘要
摘要数据集通常是通过刮擦天然发生的公共域摘要(几乎总是在技术领域中很难工作的)来组装的,或者是通过使用近似启发式方法从日常文本中提取它们的 - 经常会产生不忠实的摘要。在这项工作中,我们转向开发摘要基准数据的较慢但更直接的方法:我们雇用了高度合格的承包商来阅读故事并从头开始编写原始摘要。为了摊销阅读时间,我们每个文档收集五个摘要,第一个给出了概述和随后的四个解决特定问题。我们使用此协议来收集Squality,这是一个基于与多项选择数据集质量相同的公共域短故事建立的针对问题的摘要的数据集(Pang等,2021)。最新的摘要系统的实验表明,我们的数据集具有挑战性,现有的自动评估指标是质量较弱的指标。
Summarization datasets are often assembled either by scraping naturally occurring public-domain summaries -- which are nearly always in difficult-to-work-with technical domains -- or by using approximate heuristics to extract them from everyday text -- which frequently yields unfaithful summaries. In this work, we turn to a slower but more straightforward approach to developing summarization benchmark data: We hire highly-qualified contractors to read stories and write original summaries from scratch. To amortize reading time, we collect five summaries per document, with the first giving an overview and the subsequent four addressing specific questions. We use this protocol to collect SQuALITY, a dataset of question-focused summaries built on the same public-domain short stories as the multiple-choice dataset QuALITY (Pang et al., 2021). Experiments with state-of-the-art summarization systems show that our dataset is challenging and that existing automatic evaluation metrics are weak indicators of quality.