论文标题

基于重叠的词汇生成改善了相关语言之间的跨语性转移

Overlap-based Vocabulary Generation Improves Cross-lingual Transfer Among Related Languages

论文作者

Patil, Vaidehi, Talukdar, Partha, Sarawagi, Sunita

论文摘要

预先训练的多语言模型(例如Mbert和XLM-R)表现出了零拍传输到低Web-Resource语言(LRL)的巨大潜力。但是,由于模型容量有限,高网页资源语言(HRL)和LRL之间可用单语言公司的大小差异很大,并没有提供足够的范围,从而使LRL与HRL相结合,从而影响LRLS的下游任务绩效。在本文中,我们认为,可以利用语言家庭中语言家庭中语言之间的相关性来克服LRL的某些语料库局限性。我们提出了重叠BPE(OBPE),这是对BPE词汇生成算法的简单而有效的修改,可增强相关语言之间的重叠。通过对多个NLP任务和数据集的广泛实验,我们观察到OBPE生成了词汇,通过与HRL共享的令牌增加了LRL的表示。这导致了从相关HRL到LRL的零射击转移,而无需降低HRL表示和准确性。与以前的研究驳回了令牌遍布的重要性不同,我们表明,在低资源相关的语言设置中,令牌重叠。合成将重叠率降低到零可能会导致零拍传递精度下降到四倍。

Pre-trained multilingual language models such as mBERT and XLM-R have demonstrated great potential for zero-shot cross-lingual transfer to low web-resource languages (LRL). However, due to limited model capacity, the large difference in the sizes of available monolingual corpora between high web-resource languages (HRL) and LRLs does not provide enough scope of co-embedding the LRL with the HRL, thereby affecting downstream task performance of LRLs. In this paper, we argue that relatedness among languages in a language family along the dimension of lexical overlap may be leveraged to overcome some of the corpora limitations of LRLs. We propose Overlap BPE (OBPE), a simple yet effective modification to the BPE vocabulary generation algorithm which enhances overlap across related languages. Through extensive experiments on multiple NLP tasks and datasets, we observe that OBPE generates a vocabulary that increases the representation of LRLs via tokens shared with HRLs. This results in improved zero-shot transfer from related HRLs to LRLs without reducing HRL representation and accuracy. Unlike previous studies that dismissed the importance of token-overlap, we show that in the low-resource related language setting, token overlap matters. Synthetically reducing the overlap to zero can cause as much as a four-fold drop in zero-shot transfer accuracy.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源