论文标题
迈向文本生成的统一的多维评估器
Towards a Unified Multi-Dimensional Evaluator for Text Generation
论文作者
论文摘要
多维评估是自然语言产生(NLG)中人类评估的主要范式,即,从多个可解释的维度(例如相干性和流利度)中评估生成的文本。但是,NLG中的自动评估仍由基于相似性的指标主导,我们缺乏对高级模型进行更全面评估的可靠框架。在本文中,我们提出了一个统一的NLG多维评估者。我们将NLG评估重新构架为布尔问题答案(QA)任务,并且通过引导模型不同的问题,我们可以使用一个评估者从多个维度进行评估。此外,由于统一的布尔质量检查格式,我们能够引入一个中级学习阶段,使Univeal能够从多个相关任务中纳入外部知识并获得进一步的改进。在三个典型的NLG任务上进行的实验表明,与现有指标相比,不符合人类判断的相关性大大更好。具体而言,与表现最佳的统一评估者相比,Unieval在文本摘要上的相关性增加了23%,对话响应的产生超过43%。此外,Univeal为看不见的评估维度和任务展示了强大的零击学习能力。源代码,数据和所有预先培训的评估器可在我们的GitHub存储库(https://github.com/maszhongming/unieval)上找到。
Multi-dimensional evaluation is the dominant paradigm for human evaluation in Natural Language Generation (NLG), i.e., evaluating the generated text from multiple explainable dimensions, such as coherence and fluency. However, automatic evaluation in NLG is still dominated by similarity-based metrics, and we lack a reliable framework for a more comprehensive evaluation of advanced models. In this paper, we propose a unified multi-dimensional evaluator UniEval for NLG. We re-frame NLG evaluation as a Boolean Question Answering (QA) task, and by guiding the model with different questions, we can use one evaluator to evaluate from multiple dimensions. Furthermore, thanks to the unified Boolean QA format, we are able to introduce an intermediate learning phase that enables UniEval to incorporate external knowledge from multiple related tasks and gain further improvement. Experiments on three typical NLG tasks show that UniEval correlates substantially better with human judgments than existing metrics. Specifically, compared to the top-performing unified evaluators, UniEval achieves a 23% higher correlation on text summarization, and over 43% on dialogue response generation. Also, UniEval demonstrates a strong zero-shot learning ability for unseen evaluation dimensions and tasks. Source code, data and all pre-trained evaluators are available on our GitHub repository (https://github.com/maszhongming/UniEval).