迈向自然语言生成的可解释的评估指标

论文标题

迈向自然语言生成的可解释的评估指标

Towards Explainable Evaluation Metrics for Natural Language Generation

论文作者

Leiter, Christoph, Lertvittayakumjorn, Piyawat, Fomicheva, Marina, Zhao, Wei, Gao, Yang, Eger, Steffen

论文摘要

与经典的词汇叠加指标（例如BLEU）不同，大多数当前的评估指标（例如Bertscore或MoverScore）基于Black-Box语言模型，例如BERT或XLM-R。他们经常与人类判断实现密切的相关性，但最近的研究表明，低质量的古典指标仍然占主导地位，其决策过程是透明的。为了促进新型高质量指标的广泛接受，解释性因此变得至关重要。在此概念论文中，我们确定了关键属性，并提出了可解释的机器翻译评估指标的关键目标。我们还提供了有关可解释的机器翻译指标的最新方法的综合概述，并讨论了它们与这些目标和属性的关系。此外，我们进行了自己的新实验，该实验（除其他）发现当前的对抗性NLP技术不适合自动识别高质量黑盒评估指标的局限性，因为它们没有意义提供。最后，我们提供了未来的方法来解释评估指标及其评估的愿景。我们希望我们的工作能够帮助催化和指导未来的有关可解释的评估指标的研究，并适当地有助于更好，更透明的文本生成系统。

Unlike classical lexical overlap metrics such as BLEU, most current evaluation metrics (such as BERTScore or MoverScore) are based on black-box language models such as BERT or XLM-R. They often achieve strong correlations with human judgments, but recent research indicates that the lower-quality classical metrics remain dominant, one of the potential reasons being that their decision processes are transparent. To foster more widespread acceptance of the novel high-quality metrics, explainability thus becomes crucial. In this concept paper, we identify key properties and propose key goals of explainable machine translation evaluation metrics. We also provide a synthesizing overview over recent approaches for explainable machine translation metrics and discuss how they relate to those goals and properties. Further, we conduct own novel experiments, which (among others) find that current adversarial NLP techniques are unsuitable for automatically identifying limitations of high-quality black-box evaluation metrics, as they are not meaning-preserving. Finally, we provide a vision of future approaches to explainable evaluation metrics and their evaluation. We hope that our work can help catalyze and guide future research on explainable evaluation metrics and, mediately, also contribute to better and more transparent text generation systems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题