一种单语的方法，用于中背景语言的上下文化单词嵌入

论文标题

一种单语的方法，用于中背景语言的上下文化单词嵌入

A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages

论文作者

Suárez, Pedro Javier Ortiz, Romary, Laurent, Sagot, Benoît

论文摘要

我们使用多语言奥斯卡语料库，通过语言分类从共同的爬网提取，过滤和清洁，以训练单语上下文化的单词嵌入（ELMO），以使用五种中资源语言。然后，我们比较了基于奥斯卡的基于奥斯卡和基于Wikipedia的Elmo嵌入这些语言在言论的标记和解析任务上的性能。我们表明，尽管基于普通爬行的奥斯卡数据中有噪音，但在奥斯卡培训中训练的嵌入式比在维基百科中训练的单语言嵌入效果要好得多。它们实际上在标记和解析所有五种语言时实际上相等或改善了当前艺术的状态。特别是，它们还改善了基于Wikipedia的多种语言上下文嵌入（多语言BERT），几乎总是构成先前的艺术状态，从而表明，更大，更多样化的语料库的好处超过了多语言嵌入体系结构的跨语言益处。

We use the multilingual OSCAR corpus, extracted from Common Crawl via language classification, filtering and cleaning, to train monolingual contextualized word embeddings (ELMo) for five mid-resource languages. We then compare the performance of OSCAR-based and Wikipedia-based ELMo embeddings for these languages on the part-of-speech tagging and parsing tasks. We show that, despite the noise in the Common-Crawl-based OSCAR data, embeddings trained on OSCAR perform much better than monolingual embeddings trained on Wikipedia. They actually equal or improve the current state of the art in tagging and parsing for all five languages. In particular, they also improve over multilingual Wikipedia-based contextual embeddings (multilingual BERT), which almost always constitutes the previous state of the art, thereby showing that the benefit of a larger, more diverse corpus surpasses the cross-lingual benefit of multilingual embedding architectures.

下载PDF全文

下载文献需遵守相关版权规定

论文标题