GPT-3时代的新闻摘要和评估

论文标题

GPT-3时代的新闻摘要和评估

News Summarization and Evaluation in the Era of GPT-3

论文作者

Goyal, Tanya, Li, Junyi Jessy, Durrett, Greg

论文摘要

促使大型语言模型（例如GPT-3）的最新成功导致NLP研究的范式转变。在本文中，我们研究了其对文本摘要的影响，重点是新闻摘要的经典基准领域。首先，我们研究了GPT-3与在大型摘要数据集中训练的微调模型的比较。我们表明，不仅人类不仅更喜欢仅使用任务描述引起的GPT-3摘要，而且这些摘要也不遭受常见的数据集特定问题，例如糟糕的事实。接下来，我们研究这对评估意味着什么，尤其是黄金标准测试集的作用。我们的实验表明，基于参考的自动指标均无法可靠地评估GPT-3摘要。最后，我们评估了超出通用摘要（特别是基于关键字的摘要）的设置上的模型，并展示了与提示相比的主要微调方法。为了支持进一步的研究，我们发布：（a）从4个标准摘要基准中的微调和基于及时的模型中产生的10K生成的摘要，（b）1K人类偏好判断比较了基于通用和关键字的摘要的不同系统。

The recent success of prompting large language models like GPT-3 has led to a paradigm shift in NLP research. In this paper, we study its impact on text summarization, focusing on the classic benchmark domain of news summarization. First, we investigate how GPT-3 compares against fine-tuned models trained on large summarization datasets. We show that not only do humans overwhelmingly prefer GPT-3 summaries, prompted using only a task description, but these also do not suffer from common dataset-specific issues such as poor factuality. Next, we study what this means for evaluation, particularly the role of gold standard test sets. Our experiments show that both reference-based and reference-free automatic metrics cannot reliably evaluate GPT-3 summaries. Finally, we evaluate models on a setting beyond generic summarization, specifically keyword-based summarization, and show how dominant fine-tuning approaches compare to prompting. To support further research, we release: (a) a corpus of 10K generated summaries from fine-tuned and prompt-based models across 4 standard summarization benchmarks, (b) 1K human preference judgments comparing different systems for generic- and keyword-based summarization.

下载PDF全文

下载文献需遵守相关版权规定

论文标题