自动MLM：改进自我监督的多语言知识检索的对比度学习

论文标题

自动MLM：改进自我监督的多语言知识检索的对比度学习

Auto-MLM: Improved Contrastive Learning for Self-supervised Multi-lingual Knowledge Retrieval

论文作者

Xu, Wenshen, Maimaiti, Mieradilijiang, Zheng, Yuanhang, Tang, Xin, Zhang, Ji

论文摘要

对比度学习（CL）已成为多种自然语言处理（NLP）下游任务的无处不在方法，尤其是对于问答（QA）的方法。但是，主要的挑战是如何以无监督的方式有效地训练知识检索模型，但仍未解决。最近，常用的方法由CL和蒙版语言模型（MLM）组成。出乎意料的是，MLM忽略了句子级的培训，CL还忽略了从查询中提取内部信息。为了优化CL几乎无法从原始查询中获取内部信息，我们通过将CL和自动MLM合并以进行自我监督的多语言知识检索来引入联合培训方法。首先，我们获得了固定的尺寸句子向量。然后，使用随机策略掩盖原始句子中的一些单词。最后，我们生成了一个新的令牌表示，以预测蒙版的令牌。实验结果表明，我们提出的方法始终优于Aliexpress $ \＆$ lazada服务语料库上所有先前的SOTA方法，并以8种语言公开可用的Corpora。

Contrastive learning (CL) has become a ubiquitous approach for several natural language processing (NLP) downstream tasks, especially for question answering (QA). However, the major challenge, how to efficiently train the knowledge retrieval model in an unsupervised manner, is still unresolved. Recently the commonly used methods are composed of CL and masked language model (MLM). Unexpectedly, MLM ignores the sentence-level training, and CL also neglects extraction of the internal info from the query. To optimize the CL hardly obtain internal information from the original query, we introduce a joint training method by combining CL and Auto-MLM for self-supervised multi-lingual knowledge retrieval. First, we acquire the fixed dimensional sentence vector. Then, mask some words among the original sentences with random strategy. Finally, we generate a new token representation for predicting the masked tokens. Experimental results show that our proposed approach consistently outperforms all the previous SOTA methods on both AliExpress $\&$ LAZADA service corpus and openly available corpora in 8 languages.

下载PDF全文

下载文献需遵守相关版权规定

论文标题