使用层次变压器对代码混合语言语义的全面理解

论文标题

使用层次变压器对代码混合语言语义的全面理解

A Comprehensive Understanding of Code-mixed Language Semantics using Hierarchical Transformer

论文作者

Sengupta, Ayan, Suresh, Tharun, Akhtar, Md Shad, Chakraborty, Tanmoy

论文摘要

作为多语言社区中基于文本沟通的一种流行方式，在线社交媒体中的代码混合已成为研究的重要主题。由于数据缺乏以及强大的和语言不变的表示技术，学习混合语言的语义和形态仍然是一个关键挑战。任何形态上丰富的语言都可以从字符，子字和单词级嵌入中受益，并有助于学习有意义的相关性。在本文中，我们探讨了基于层次变压器的体系结构（HIT），以了解代码混合语言的语义。 HIT由多头自我注意力和外部产品注意力组成组成，以同时理解代码混合文本的语义和句法结构。我们评估了6种印度语言（孟加拉语，古吉拉特语，印地语，泰米尔语，泰卢固语和马拉雅拉姆语）和西班牙语的拟议方法，用于17个数据集中的9个NLP任务。在所有任务中，HIT模型优于最先进的代码混合表示学习和多语言语言模型。我们进一步证明了使用基于蒙版语言建模的预训练，零射击学习和转移学习方法的命中架构的普遍性。我们的经验结果表明，训练前目标可显着提高下游任务的性能。

Being a popular mode of text-based communication in multilingual communities, code-mixing in online social media has became an important subject to study. Learning the semantics and morphology of code-mixed language remains a key challenge, due to scarcity of data and unavailability of robust and language-invariant representation learning technique. Any morphologically-rich language can benefit from character, subword, and word-level embeddings, aiding in learning meaningful correlations. In this paper, we explore a hierarchical transformer-based architecture (HIT) to learn the semantics of code-mixed languages. HIT consists of multi-headed self-attention and outer product attention components to simultaneously comprehend the semantic and syntactic structures of code-mixed texts. We evaluate the proposed method across 6 Indian languages (Bengali, Gujarati, Hindi, Tamil, Telugu and Malayalam) and Spanish for 9 NLP tasks on 17 datasets. The HIT model outperforms state-of-the-art code-mixed representation learning and multilingual language models in all tasks. We further demonstrate the generalizability of the HIT architecture using masked language modeling-based pre-training, zero-shot learning, and transfer learning approaches. Our empirical results show that the pre-training objectives significantly improve the performance on downstream tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题