论文标题
关于通过无参考机器翻译评估暴露的跨语义编码器的局限性
On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation
论文作者
论文摘要
通常在监督下游任务中或通过无监督的跨语言文本相似性中,通过零拍的跨语性转移对跨语性编码器进行评估。在本文中,我们关注的是无参考的机器翻译(MT)评估,我们将源文本直接比较(有时是低质量)系统翻译,这代表了用于多语言编码器的自然对抗设置。无参考评估具有MT系统的Web规模比较的承诺。我们系统地研究了一系列基于预验证的M-Bert和Laser获得的最先进的跨语性语义表示。我们发现它们作为语义编码器的表现不佳,用于无参考MT评估,并确定其两个关键局限性,即(a)相互翻译的表示与更突出的语义不匹配,(b)无法惩罚“翻译”,即低质文字翻译。我们提出了两种部分补救措施:(1)事后重新对准向量空间,(2)基于语义相似的指标与目标语言建模的基于语义相似的指标耦合。在细分级评估中,我们最佳的度量标准超过了5.7相关点。
Evaluation of cross-lingual encoders is usually performed either via zero-shot cross-lingual transfer in supervised downstream tasks or via unsupervised cross-lingual textual similarity. In this paper, we concern ourselves with reference-free machine translation (MT) evaluation where we directly compare source texts to (sometimes low-quality) system translations, which represents a natural adversarial setup for multilingual encoders. Reference-free evaluation holds the promise of web-scale comparison of MT systems. We systematically investigate a range of metrics based on state-of-the-art cross-lingual semantic representations obtained with pretrained M-BERT and LASER. We find that they perform poorly as semantic encoders for reference-free MT evaluation and identify their two key limitations, namely, (a) a semantic mismatch between representations of mutual translations and, more prominently, (b) the inability to punish "translationese", i.e., low-quality literal translations. We propose two partial remedies: (1) post-hoc re-alignment of the vector spaces and (2) coupling of semantic-similarity based metrics with target-side language modeling. In segment-level MT evaluation, our best metric surpasses reference-based BLEU by 5.7 correlation points.