论文标题
秘密是在光谱中:通过光谱相似性测量预测跨语性任务表现
The Secret is in the Spectra: Predicting Cross-lingual Task Performance with Spectral Similarity Measures
论文作者
论文摘要
跨语性NLP任务的表现受到手头语言的(DIS)相似性的影响:在这项工作中,我们介绍了一项大规模研究,重点介绍了单语嵌入空间相似性和任务性能之间的相关性,涵盖了数千种语言对和四个不同的任务:BLI,解析,POS标记和MT。我们假设每个单语言嵌入空间的频谱的统计数据表明它们可以对齐。然后,我们根据其各个光谱的相关统计数据在两个嵌入空间之间引入了几种同构措施。我们从经验上表明,1)语言相似性得分从这种频谱同构措施得出的分数与在不同的跨语性任务中观察到的性能密切相关,而2)我们基于光谱的措施一致地优于先前的标准同构措施,而计算在计算上更可散布,更易于解释。最后,我们的措施将互补的信息捕获到类型驱动的语言距离度量中,而来自两个家庭的措施的组合产生了更高的任务绩效相关性。
Performance in cross-lingual NLP tasks is impacted by the (dis)similarity of languages at hand: e.g., previous work has suggested there is a connection between the expected success of bilingual lexicon induction (BLI) and the assumption of (approximate) isomorphism between monolingual embedding spaces. In this work we present a large-scale study focused on the correlations between monolingual embedding space similarity and task performance, covering thousands of language pairs and four different tasks: BLI, parsing, POS tagging and MT. We hypothesize that statistics of the spectrum of each monolingual embedding space indicate how well they can be aligned. We then introduce several isomorphism measures between two embedding spaces, based on the relevant statistics of their individual spectra. We empirically show that 1) language similarity scores derived from such spectral isomorphism measures are strongly associated with performance observed in different cross-lingual tasks, and 2) our spectral-based measures consistently outperform previous standard isomorphism measures, while being computationally more tractable and easier to interpret. Finally, our measures capture complementary information to typologically driven language distance measures, and the combination of measures from the two families yields even higher task performance correlations.