文本中的语言类型学特征：推断语言结构世界地图集的稀疏特征

论文标题

文本中的语言类型学特征：推断语言结构世界地图集的稀疏特征

Linguistic Typology Features from Text: Inferring the Sparse Features of World Atlas of Language Structures

论文作者

Gutkin, Alexander, Merkulova, Tatiana, Jansche, Martin

论文摘要

语言类型学在自然语言处理中的使用一直在越来越受欢迎。已经观察到，使用类型学信息通常与分布式语言表示相结合，从而导致更强大的模型。尽管来自各种资源的语言类型学表示形式主要用于调理模型，但对从输入数据中预测这些资源的功能的关注相对较少。在本文中，我们研究了来自世界语言结构（WALS）世界地图集的各种语言特征是否可以可靠地从多语言文本中可靠地推断出来。这样的预测因子可用于推断训练数据中从未观察到的语言的结构特征。我们将此任务构图为多标签分类，涉及预测一组非多余的排他性和极为稀疏的多价值标签（WALS特征）。我们基于字节嵌入和卷积层构建了一个经常性的神经网络预测器，并在556种语言上测试其性能，为各种语言类型，宏观分区，语言家庭和个人特征提供分析。我们表明，可以可靠地预测来自各种语言类型的某些功能。

The use of linguistic typological resources in natural language processing has been steadily gaining more popularity. It has been observed that the use of typological information, often combined with distributed language representations, leads to significantly more powerful models. While linguistic typology representations from various resources have mostly been used for conditioning the models, there has been relatively little attention on predicting features from these resources from the input data. In this paper we investigate whether the various linguistic features from World Atlas of Language Structures (WALS) can be reliably inferred from multi-lingual text. Such a predictor can be used to infer structural features for a language never observed in training data. We frame this task as a multi-label classification involving predicting the set of non-mutually exclusive and extremely sparse multi-valued labels (WALS features). We construct a recurrent neural network predictor based on byte embeddings and convolutional layers and test its performance on 556 languages, providing analysis for various linguistic types, macro-areas, language families and individual features. We show that some features from various linguistic types can be predicted reliably.

下载PDF全文

下载文献需遵守相关版权规定

论文标题