论文标题
GPT-3时代的新闻摘要和评估
News Summarization and Evaluation in the Era of GPT-3
论文作者
论文摘要
促使大型语言模型(例如GPT-3)的最新成功导致NLP研究的范式转变。在本文中,我们研究了其对文本摘要的影响,重点是新闻摘要的经典基准领域。首先,我们研究了GPT-3与在大型摘要数据集中训练的微调模型的比较。我们表明,不仅人类不仅更喜欢仅使用任务描述引起的GPT-3摘要,而且这些摘要也不遭受常见的数据集特定问题,例如糟糕的事实。接下来,我们研究这对评估意味着什么,尤其是黄金标准测试集的作用。我们的实验表明,基于参考的自动指标均无法可靠地评估GPT-3摘要。最后,我们评估了超出通用摘要(特别是基于关键字的摘要)的设置上的模型,并展示了与提示相比的主要微调方法。 为了支持进一步的研究,我们发布:(a)从4个标准摘要基准中的微调和基于及时的模型中产生的10K生成的摘要,(b)1K人类偏好判断比较了基于通用和关键字的摘要的不同系统。
The recent success of prompting large language models like GPT-3 has led to a paradigm shift in NLP research. In this paper, we study its impact on text summarization, focusing on the classic benchmark domain of news summarization. First, we investigate how GPT-3 compares against fine-tuned models trained on large summarization datasets. We show that not only do humans overwhelmingly prefer GPT-3 summaries, prompted using only a task description, but these also do not suffer from common dataset-specific issues such as poor factuality. Next, we study what this means for evaluation, particularly the role of gold standard test sets. Our experiments show that both reference-based and reference-free automatic metrics cannot reliably evaluate GPT-3 summaries. Finally, we evaluate models on a setting beyond generic summarization, specifically keyword-based summarization, and show how dominant fine-tuning approaches compare to prompting. To support further research, we release: (a) a corpus of 10K generated summaries from fine-tuned and prompt-based models across 4 standard summarization benchmarks, (b) 1K human preference judgments comparing different systems for generic- and keyword-based summarization.