从深层变压器学习轻量级翻译模型

论文标题

从深层变压器学习轻量级翻译模型

Learning Light-Weight Translation Models from Deep Transformer

论文作者

Li, Bei, Wang, Ziyang, Liu, Hui, Du, Quan, Xiao, Tong, Zhang, Chunliang, Zhu, Jingbo

论文摘要

最近，深层模型显示了神经机译（NMT）的巨大改进。但是，这种系统在计算上是昂贵且内存密集的。在本文中，我们自然而然地学习强大但轻巧的NMT系统。我们提出了一种新型的基于群体的知识蒸馏方法，将深层变压器模型压缩为浅层模型。几个基准的实验结果验证了我们方法的有效性。我们的压缩模型比深层模型要浅8倍，而BLEU几乎没有损失。为了进一步增强教师模型，我们提出了一种跳过的子层方法，以随机省略子层以将扰动引入训练中，该方法在英语 - 德国纽斯特斯特2014上的BLEU得分为30.63。该代码可在https://github.com/libeineu/gpkd上公开获取。

Recently, deep models have shown tremendous improvements in neural machine translation (NMT). However, systems of this kind are computationally expensive and memory intensive. In this paper, we take a natural step towards learning strong but light-weight NMT systems. We proposed a novel group-permutation based knowledge distillation approach to compressing the deep Transformer model into a shallow model. The experimental results on several benchmarks validate the effectiveness of our method. Our compressed model is 8X shallower than the deep model, with almost no loss in BLEU. To further enhance the teacher model, we present a Skipping Sub-Layer method to randomly omit sub-layers to introduce perturbation into training, which achieves a BLEU score of 30.63 on English-German newstest2014. The code is publicly available at https://github.com/libeineu/GPKD.

下载PDF全文

下载文献需遵守相关版权规定

论文标题