使用图形卷积网络的多任务文本分类用于大规模低资源语言

论文标题

使用图形卷积网络的多任务文本分类用于大规模低资源语言

Multi-Task Text Classification using Graph Convolutional Networks for Large-Scale Low Resource Language

论文作者

Marreddy, Mounika, Oota, Subba Reddy, Vakada, Lakshmi Sireesha, Chinni, Venkata Charan, Mamidi, Radhika

论文摘要

图形卷积网络（GCN）已在单个文本分类任务（例如情感分析，情感检测等）上实现了最新结果。但是，通过测试和报告诸如英语之类的资源丰富的语言，可以实现性能。将GCN用于多任务文本分类是未探索的区域。此外，培训GCN或对印度语言采用英语GCN通常受到数据可用性，丰富的形态变化，语法和语义差异的限制。在本文中，我们研究了四种自然语言处理（NLP）任务的单个和多任务设置中的GCN对泰卢固语语言的使用，即。情感分析（SA），情绪识别（EI），仇恨语音（HS）和讽刺检测（SAR）。为了用一种印度语言评估GCN的性能，泰卢固语，我们通过对四个下游任务进行了广泛的实验分析基于GCN的模型。此外，我们为四个NLP任务创建了一个注释的Telugu数据集Tel-NLP。此外，我们提出了一种监督的图形重建方法，泰卢固语上的多任务文本GCN（MT-TEXT GCN），该方法同时利用（i）从使用图形自动编码器（GAE）和（ii）使用这些lati-trent crapter pription clipterifiends frol drant enprent craption clightififity from drent enflent clightififient from plotie enprent clightififience from drent tectrentifience frol word-sentence grongonstruction从单词句子式图形重构中学习低维单词和图形嵌入。我们认为，我们提出的MT-TEXT GCN对TEL-NLP进行了显着改善，而不是现有的Telugu预读单词嵌入，以及多语言预审预测的变压器模型：Mbert和XLM-R。在Tel-NLP上，我们达到了四个NLP任务的高F1分数：SA（0.84），EI（0.55），HS（0.83）和SAR（0.66）。最后，我们对泰卢固语的四个NLP任务进行了模型的定量和定性分析。

Graph Convolutional Networks (GCN) have achieved state-of-art results on single text classification tasks like sentiment analysis, emotion detection, etc. However, the performance is achieved by testing and reporting on resource-rich languages like English. Applying GCN for multi-task text classification is an unexplored area. Moreover, training a GCN or adopting an English GCN for Indian languages is often limited by data availability, rich morphological variation, syntax, and semantic differences. In this paper, we study the use of GCN for the Telugu language in single and multi-task settings for four natural language processing (NLP) tasks, viz. sentiment analysis (SA), emotion identification (EI), hate-speech (HS), and sarcasm detection (SAR). In order to evaluate the performance of GCN with one of the Indian languages, Telugu, we analyze the GCN based models with extensive experiments on four downstream tasks. In addition, we created an annotated Telugu dataset, TEL-NLP, for the four NLP tasks. Further, we propose a supervised graph reconstruction method, Multi-Task Text GCN (MT-Text GCN) on the Telugu that leverages to simultaneously (i) learn the low-dimensional word and sentence graph embeddings from word-sentence graph reconstruction using graph autoencoder (GAE) and (ii) perform multi-task text classification using these latent sentence graph embeddings. We argue that our proposed MT-Text GCN achieves significant improvements on TEL-NLP over existing Telugu pretrained word embeddings, and multilingual pretrained Transformer models: mBERT, and XLM-R. On TEL-NLP, we achieve a high F1-score for four NLP tasks: SA (0.84), EI (0.55), HS (0.83) and SAR (0.66). Finally, we show our model's quantitative and qualitative analysis on the four NLP tasks in Telugu.

下载PDF全文

下载文献需遵守相关版权规定

论文标题