局部字节融合用于神经机器翻译

论文标题

局部字节融合用于神经机器翻译

Local Byte Fusion for Neural Machine Translation

论文作者

Sreedhar, Makesh Narsimhan, Wan, Xiangpeng, Cheng, Yu, Hu, Junjie

论文摘要

子词令牌化方案是当前NLP模型中使用的主要技术。但是，这样的方案可以是僵化的，而建立在一个语料库上的象征者不能很好地适应其他并行的语料库。还可以观察到，在多语言语料库中，子字出现方案超段的低资源语言导致翻译性能下降。子单词引物器的一种简单替代方法是基于字节的方法，即使用编码方案（例如UTF-8）将字节化到字节序列中。字节令牌通常代表子字符粒度的输入，即一个字符可以由一系列多个字节令牌表示。这导致字节序列明显长于字符序列。在下层中强制执行本地信息的聚合可以指导模型构建高级语义信息。我们为基于字节的机器翻译提出了一种本地字节融合（LOBEF）方法 - 利用字节$ n $ -gr和单词边界 - 汇总本地语义信息。关于多语言翻译，零拍的跨语性转移和域的适应性的广泛实验表明，基于传统字节的模型甚至子词技术的一致改进。进一步的分析还表明，我们的基于字节的模型是参数效率高的，并且可以比子字模型更快地训练。

Subword tokenization schemes are the dominant technique used in current NLP models. However, such schemes can be rigid and tokenizers built on one corpus do not adapt well to other parallel corpora. It has also been observed that in multilingual corpora, subword tokenization schemes over-segment low-resource languages leading to a drop in translation performance. A simple alternative to subword tokenizers is byte-based methods i.e. tokenization into byte sequences using encoding schemes such as UTF-8. Byte tokens often represent inputs at a sub-character granularity i.e. one character can be represented by a sequence of multiple byte tokens. This results in byte sequences that are significantly longer than character sequences. Enforcing aggregation of local information in the lower layers can guide the model to build higher-level semantic information. We propose a Local Byte Fusion (LOBEF) method for byte-based machine translation -- utilizing byte $n$-gram and word boundaries -- to aggregate local semantic information. Extensive experiments on multilingual translation, zero-shot cross-lingual transfer, and domain adaptation reveal a consistent improvement over traditional byte-based models and even over subword techniques. Further analysis also indicates that our byte-based models are parameter-efficient and can be trained faster than subword models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题