论文标题

Bleu可能有罪,但参考并非无辜

BLEU might be Guilty but References are not Innocent

论文作者

Freitag, Markus, Grangier, David, Caswell, Isaac

论文摘要

机器翻译的自动指标质量越来越受到质疑,尤其是对于高质量的系统。本文表明,尽管选择度量很重要,但参考文献的性质也至关重要。我们研究了不同的方法来收集参考文献并通过报告与各种系统和指标的人类评估的相关性来比较其在自动化评估中的价值。由于典型参考文献表现出差的多样性,集中在翻译语言上的发现,我们为语言学家制定了一项释义任务,以使语言学家在现有的参考翻译上执行这种偏见,从而抵制这种偏见。我们的方法与人类判断的相关性不仅与WMT 2019英语对德语的提交相关性,而且还与反向翻译和APE增强MT输出相关性,这已证明使用标准参考文献与自动指标具有较低的相关性。我们证明,我们的方法论改善了与我们查看的所有现代评估指标(包括基于嵌入的方法)的相关性。为了完成这张图片,我们揭示了多引用BLEU不能改善高质量产出的相关性,并提出了更有效的替代性多引用公式。

The quality of automatic metrics for machine translation has been increasingly called into question, especially for high-quality systems. This paper demonstrates that, while choice of metric is important, the nature of the references is also critical. We study different methods to collect references and compare their value in automated evaluation by reporting correlation with human evaluation for a variety of systems and metrics. Motivated by the finding that typical references exhibit poor diversity, concentrating around translationese language, we develop a paraphrasing task for linguists to perform on existing reference translations, which counteracts this bias. Our method yields higher correlation with human judgment not only for the submissions of WMT 2019 English to German, but also for Back-translation and APE augmented MT output, which have been shown to have low correlation with automatic metrics using standard references. We demonstrate that our methodology improves correlation with all modern evaluation metrics we look at, including embedding-based methods. To complete this picture, we reveal that multi-reference BLEU does not improve the correlation for high quality output, and present an alternative multi-reference formulation that is more effective.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源