论文标题
Glge:新的一般语言生成评估基准
GLGE: A New General Language Generation Evaluation Benchmark
论文作者
论文摘要
多任务基准(例如胶水和超级亮度)在自然语言处理(NLP)中驱动了训练和转移学习的巨大进步。这些基准主要集中于一系列自然语言理解(NLU)任务,而无需考虑自然语言产生(NLG)模型。在本文中,我们介绍了通用语言生成评估(GLGE),这是一种新的多任务基准,用于评估八个语言生成任务中NLG模型的概括能力。对于每个任务,我们将继续在任务难度(Glge-Easy,Glge-Medium和Glge-Hard)方面设计三个子任务。这介绍了24个子任务以全面比较模型性能。为了鼓励对NLG模型进行训练和转移学习的研究,我们可以公开使用GLGE,并在包括Mass,Bart和Prophetnet在内的强大基准建立排行榜(源代码和数据集在https://github.com/microsoft/glge上公开可用)。
Multi-task benchmarks such as GLUE and SuperGLUE have driven great progress of pretraining and transfer learning in Natural Language Processing (NLP). These benchmarks mostly focus on a range of Natural Language Understanding (NLU) tasks, without considering the Natural Language Generation (NLG) models. In this paper, we present the General Language Generation Evaluation (GLGE), a new multi-task benchmark for evaluating the generalization capabilities of NLG models across eight language generation tasks. For each task, we continue to design three subtasks in terms of task difficulty (GLGE-Easy, GLGE-Medium, and GLGE-Hard). This introduces 24 subtasks to comprehensively compare model performance. To encourage research on pretraining and transfer learning on NLG models, we make GLGE publicly available and build a leaderboard with strong baselines including MASS, BART, and ProphetNet (The source code and dataset are publicly available at https://github.com/microsoft/glge).