论文标题
评估中文和越南之间的低资源机器翻译
Evaluating Low-Resource Machine Translation between Chinese and Vietnamese with Back-Translation
论文作者
论文摘要
后背翻译(BT)已被广泛使用,并成为神经机器翻译(NMT)数据增强的标准技术之一,BT已被证明有助于有效地改善翻译的性能,尤其是对于低资源场景。尽管大多数与BT有关的作品主要集中在欧洲语言上,但其中很少有人在世界其他领域学习语言。在本文中,我们调查了BT对亚洲极低的中文和越南语言对之间的影响。我们评估和比较了不同尺寸的合成数据对中文对越南和越南语的NMT和统计机器翻译(SMT)模型的影响,以及基于字符的基于字符和基于单词的设置。先前作品的一些结论得到了部分确认,我们还得出了其他一些有趣的发现和结论,这些发现和结论对进一步了解BT是有益的。
Back translation (BT) has been widely used and become one of standard techniques for data augmentation in Neural Machine Translation (NMT), BT has proven to be helpful for improving the performance of translation effectively, especially for low-resource scenarios. While most works related to BT mainly focus on European languages, few of them study languages in other areas around the world. In this paper, we investigate the impacts of BT on Asia language translations between the extremely low-resource Chinese and Vietnamese language pair. We evaluate and compare the effects of different sizes of synthetic data on both NMT and Statistical Machine Translation (SMT) models for Chinese to Vietnamese and Vietnamese to Chinese, with character-based and word-based settings. Some conclusions from previous works are partially confirmed and we also draw some other interesting findings and conclusions, which are beneficial to understand BT further.