学习和评估91种语言的情绪词典

论文标题

学习和评估91种语言的情绪词典

Learning and Evaluating Emotion Lexicons for 91 Languages

论文作者

Buechel, Sven, Rücker, Susanna, Hahn, Udo

论文摘要

情感词典描述了单词的情感含义，因此构成了高级情感和情感分析的核心。然而，手动策划的词典仅适用于少数语言，使世界上的大多数语言都没有如此宝贵的资源用于下游应用。更糟糕的是，它们的覆盖范围通常在其所包含的词汇单元和它们所具有的情感变量方面受到限制。为了打破这种瓶颈，我们在这里介绍了一种方法，用于为任何目标语言创建几乎任意大型的情感词典。我们的方法只需要一种源语言情感词典，双语单词翻译模型和目标语言嵌入模型。满足对91种语言的这些要求，我们能够产生代表丰富的高覆盖词典，其中包括八个情绪变量，每个情感变量都有超过100k的词汇条目。我们评估了26个数据集对人类判断的自动生成的词典，涵盖了12种类型上多样的语言，并发现我们的方法与最新的单语言方法相一致，从而使词汇创建的最新单语方法甚至超过了某些语言和变量的人类可靠性。代码和数据可在https://github.com/julielab/memolon https://doi.org/10.5281/zenodo.3779901获得。

Emotion lexicons describe the affective meaning of words and thus constitute a centerpiece for advanced sentiment and emotion analysis. Yet, manually curated lexicons are only available for a handful of languages, leaving most languages of the world without such a precious resource for downstream applications. Even worse, their coverage is often limited both in terms of the lexical units they contain and the emotional variables they feature. In order to break this bottleneck, we here introduce a methodology for creating almost arbitrarily large emotion lexicons for any target language. Our approach requires nothing but a source language emotion lexicon, a bilingual word translation model, and a target language embedding model. Fulfilling these requirements for 91 languages, we are able to generate representationally rich high-coverage lexicons comprising eight emotional variables with more than 100k lexical entries each. We evaluated the automatically generated lexicons against human judgment from 26 datasets, spanning 12 typologically diverse languages, and found that our approach produces results in line with state-of-the-art monolingual approaches to lexicon creation and even surpasses human reliability for some languages and variables. Code and data are available at https://github.com/JULIELab/MEmoLon archived under DOI https://doi.org/10.5281/zenodo.3779901.

下载PDF全文

下载文献需遵守相关版权规定

论文标题