您的上下文感知的MT系统可以通过DIP基准测试吗？：在机器翻译中的话语现象的评估基准

论文标题

您的上下文感知的MT系统可以通过DIP基准测试吗？：在机器翻译中的话语现象的评估基准

Can Your Context-Aware MT System Pass the DiP Benchmark Tests? : Evaluation Benchmarks for Discourse Phenomena in Machine Translation

论文作者

Jwalapuram, Prathyusha, Rychalska, Barbara, Joty, Shafiq, Basaj, Dominika

论文摘要

尽管包括上下文信息在内的机器翻译（MT）系统的实例增加了，但翻译质量改进的证据却很少，尤其是对于话语现象而言。诸如BLEU之类的流行指标不足以捕获质量改进或降低尺寸较小但具有意义意义的质量改进。我们介绍了旨在跟踪和冰雹四个主要话语现象的改进的第一个MT基准数据集：图：图，词汇一致性，一致性和可读性以及话语连接翻译。我们还介绍了这些任务的评估方法，并评估了策划数据集上的几个基线MT系统。令人惊讶的是，我们发现现有的上下文感知模型不会跨语言和现象始终如一地改善与话语相关的翻译。

Despite increasing instances of machine translation (MT) systems including contextual information, the evidence for translation quality improvement is sparse, especially for discourse phenomena. Popular metrics like BLEU are not expressive or sensitive enough to capture quality improvements or drops that are minor in size but significant in perception. We introduce the first of their kind MT benchmark datasets that aim to track and hail improvements across four main discourse phenomena: anaphora, lexical consistency, coherence and readability, and discourse connective translation. We also introduce evaluation methods for these tasks, and evaluate several baseline MT systems on the curated datasets. Surprisingly, we find that existing context-aware models do not improve discourse-related translations consistently across languages and phenomena.

下载PDF全文

下载文献需遵守相关版权规定

论文标题