论文标题

lareqa:从多语言池中检索语言不足的答案

LAReQA: Language-agnostic answer retrieval from a multilingual pool

论文作者

Roy, Uma, Constant, Noah, Al-Rfou, Rami, Barua, Aditya, Phillips, Aaron, Yang, Yinfei

论文摘要

我们提出了Lareqa,这是从多语言候选人池中检索语言不合稳定答案的挑战性新基准。与以前的跨语性任务不同,LAREQA测试“强”跨语性对准,需要语义相关的跨语言对在表示空间中要比无关的同一语言对更接近。在多语言伯特(Mbert)的基础上,我们研究了实现强大一致性的不同策略。我们发现,通过机器翻译增加培训数据是有效的,并且在使用Mbert开箱即用的情况下可以显着改善。有趣的是,在Lareqa上表现最佳的嵌入基线在我们任务的零击变体上竞争基线,仅针对“弱”对准。这一发现强调了我们的说法,即语言发现是一种实质上的跨语性评估。

We present LAReQA, a challenging new benchmark for language-agnostic answer retrieval from a multilingual candidate pool. Unlike previous cross-lingual tasks, LAReQA tests for "strong" cross-lingual alignment, requiring semantically related cross-language pairs to be closer in representation space than unrelated same-language pairs. Building on multilingual BERT (mBERT), we study different strategies for achieving strong alignment. We find that augmenting training data via machine translation is effective, and improves significantly over using mBERT out-of-the-box. Interestingly, the embedding baseline that performs the best on LAReQA falls short of competing baselines on zero-shot variants of our task that only target "weak" alignment. This finding underscores our claim that languageagnostic retrieval is a substantively new kind of cross-lingual evaluation.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源