基于重叠的词汇生成改善了相关语言之间的跨语性转移

论文标题

基于重叠的词汇生成改善了相关语言之间的跨语性转移

Overlap-based Vocabulary Generation Improves Cross-lingual Transfer Among Related Languages

论文作者

Patil, Vaidehi, Talukdar, Partha, Sarawagi, Sunita

论文摘要

预先训练的多语言模型（例如Mbert和XLM-R）表现出了零拍传输到低Web-Resource语言（LRL）的巨大潜力。但是，由于模型容量有限，高网页资源语言（HRL）和LRL之间可用单语言公司的大小差异很大，并没有提供足够的范围，从而使LRL与HRL相结合，从而影响LRLS的下游任务绩效。在本文中，我们认为，可以利用语言家庭中语言家庭中语言之间的相关性来克服LRL的某些语料库局限性。我们提出了重叠BPE（OBPE），这是对BPE词汇生成算法的简单而有效的修改，可增强相关语言之间的重叠。通过对多个NLP任务和数据集的广泛实验，我们观察到OBPE生成了词汇，通过与HRL共享的令牌增加了LRL的表示。这导致了从相关HRL到LRL的零射击转移，而无需降低HRL表示和准确性。与以前的研究驳回了令牌遍布的重要性不同，我们表明，在低资源相关的语言设置中，令牌重叠。合成将重叠率降低到零可能会导致零拍传递精度下降到四倍。

Pre-trained multilingual language models such as mBERT and XLM-R have demonstrated great potential for zero-shot cross-lingual transfer to low web-resource languages (LRL). However, due to limited model capacity, the large difference in the sizes of available monolingual corpora between high web-resource languages (HRL) and LRLs does not provide enough scope of co-embedding the LRL with the HRL, thereby affecting downstream task performance of LRLs. In this paper, we argue that relatedness among languages in a language family along the dimension of lexical overlap may be leveraged to overcome some of the corpora limitations of LRLs. We propose Overlap BPE (OBPE), a simple yet effective modification to the BPE vocabulary generation algorithm which enhances overlap across related languages. Through extensive experiments on multiple NLP tasks and datasets, we observe that OBPE generates a vocabulary that increases the representation of LRLs via tokens shared with HRLs. This results in improved zero-shot transfer from related HRLs to LRLs without reducing HRL representation and accuracy. Unlike previous studies that dismissed the importance of token-overlap, we show that in the low-resource related language setting, token overlap matters. Synthetically reducing the overlap to zero can cause as much as a four-fold drop in zero-shot transfer accuracy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题