关于通过无参考机器翻译评估暴露的跨语义编码器的局限性

论文标题

关于通过无参考机器翻译评估暴露的跨语义编码器的局限性

On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation

论文作者

Zhao, Wei, Glavaš, Goran, Peyrard, Maxime, Gao, Yang, West, Robert, Eger, Steffen

论文摘要

通常在监督下游任务中或通过无监督的跨语言文本相似性中，通过零拍的跨语性转移对跨语性编码器进行评估。在本文中，我们关注的是无参考的机器翻译（MT）评估，我们将源文本直接比较（有时是低质量）系统翻译，这代表了用于多语言编码器的自然对抗设置。无参考评估具有MT系统的Web规模比较的承诺。我们系统地研究了一系列基于预验证的M-Bert和Laser获得的最先进的跨语性语义表示。我们发现它们作为语义编码器的表现不佳，用于无参考MT评估，并确定其两个关键局限性，即（a）相互翻译的表示与更突出的语义不匹配，（b）无法惩罚“翻译”，即低质文字翻译。我们提出了两种部分补救措施：（1）事后重新对准向量空间，（2）基于语义相似的指标与目标语言建模的基于语义相似的指标耦合。在细分级评估中，我们最佳的度量标准超过了5.7相关点。

Evaluation of cross-lingual encoders is usually performed either via zero-shot cross-lingual transfer in supervised downstream tasks or via unsupervised cross-lingual textual similarity. In this paper, we concern ourselves with reference-free machine translation (MT) evaluation where we directly compare source texts to (sometimes low-quality) system translations, which represents a natural adversarial setup for multilingual encoders. Reference-free evaluation holds the promise of web-scale comparison of MT systems. We systematically investigate a range of metrics based on state-of-the-art cross-lingual semantic representations obtained with pretrained M-BERT and LASER. We find that they perform poorly as semantic encoders for reference-free MT evaluation and identify their two key limitations, namely, (a) a semantic mismatch between representations of mutual translations and, more prominently, (b) the inability to punish "translationese", i.e., low-quality literal translations. We propose two partial remedies: (1) post-hoc re-alignment of the vector spaces and (2) coupling of semantic-similarity based metrics with target-side language modeling. In segment-level MT evaluation, our best metric surpasses reference-based BLEU by 5.7 correlation points.

下载PDF全文

下载文献需遵守相关版权规定

论文标题