Glge：新的一般语言生成评估基准

论文标题

Glge：新的一般语言生成评估基准

GLGE: A New General Language Generation Evaluation Benchmark

论文作者

Liu, Dayiheng, Yan, Yu, Gong, Yeyun, Qi, Weizhen, Zhang, Hang, Jiao, Jian, Chen, Weizhu, Fu, Jie, Shou, Linjun, Gong, Ming, Wang, Pengcheng, Chen, Jiusheng, Jiang, Daxin, Lv, Jiancheng, Zhang, Ruofei, Wu, Winnie, Zhou, Ming, Duan, Nan

论文摘要

多任务基准（例如胶水和超级亮度）在自然语言处理（NLP）中驱动了训练和转移学习的巨大进步。这些基准主要集中于一系列自然语言理解（NLU）任务，而无需考虑自然语言产生（NLG）模型。在本文中，我们介绍了通用语言生成评估（GLGE），这是一种新的多任务基准，用于评估八个语言生成任务中NLG模型的概括能力。对于每个任务，我们将继续在任务难度（Glge-Easy，Glge-Medium和Glge-Hard）方面设计三个子任务。这介绍了24个子任务以全面比较模型性能。为了鼓励对NLG模型进行训练和转移学习的研究，我们可以公开使用GLGE，并在包括Mass，Bart和Prophetnet在内的强大基准建立排行榜（源代码和数据集在https://github.com/microsoft/glge上公开可用）。

Multi-task benchmarks such as GLUE and SuperGLUE have driven great progress of pretraining and transfer learning in Natural Language Processing (NLP). These benchmarks mostly focus on a range of Natural Language Understanding (NLU) tasks, without considering the Natural Language Generation (NLG) models. In this paper, we present the General Language Generation Evaluation (GLGE), a new multi-task benchmark for evaluating the generalization capabilities of NLG models across eight language generation tasks. For each task, we continue to design three subtasks in terms of task difficulty (GLGE-Easy, GLGE-Medium, and GLGE-Hard). This introduces 24 subtasks to comprehensively compare model performance. To encourage research on pretraining and transfer learning on NLG models, we make GLGE publicly available and build a leaderboard with strong baselines including MASS, BART, and ProphetNet (The source code and dataset are publicly available at https://github.com/microsoft/glge).

下载PDF全文

下载文献需遵守相关版权规定

论文标题