论文标题

利用切换的代码超越语言障碍

Harnessing Code Switching to Transcend the Linguistic Barrier

论文作者

KhudaBukhsh, Ashiqur R., Palakodety, Shriphani, Carbonell, Jaime G.

论文摘要

代码混合(或代码切换)是在语言多样性用户基础产生的社交媒体内容中观察到的常见现象。研究表明,在印度次大陆中,很大一部分社交媒体帖子展示了代码转换。虽然Code混合文档在下游分析中遇到的困难是充分理解的,但在某些情况下,对代码混合文档的可见性可能具有以前被忽略的效用。例如,以多种语言的混合物编写的文档可以部分访问更广泛的受众。如果相当一部分观众缺乏一种元素语言的流利性,这可能特别有用。在本文中,我们提供了一种系统的方法来示例代码混合文档,该文件利用了基于多嵌入的方法,需要最少的监督。 In the context of the 2019 India-Pakistan conflict triggered by the Pulwama terror attack, we demonstrate an untapped potential of harnessing code mixing for human well-being: starting from an existing hostility diffusing \emph{hope speech} classifier solely trained on English documents, code mixed documents are utilized as a bridge to retrieve \emph{hope speech} content written in a low-resource but widely used language - Romanized印地语。我们提出的管道需要最少的监督,并在大大减少Web适量工作方面具有希望。

Code mixing (or code switching) is a common phenomenon observed in social-media content generated by a linguistically diverse user-base. Studies show that in the Indian sub-continent, a substantial fraction of social media posts exhibit code switching. While the difficulties posed by code mixed documents to further downstream analyses are well-understood, lending visibility to code mixed documents under certain scenarios may have utility that has been previously overlooked. For instance, a document written in a mixture of multiple languages can be partially accessible to a wider audience; this could be particularly useful if a considerable fraction of the audience lacks fluency in one of the component languages. In this paper, we provide a systematic approach to sample code mixed documents leveraging a polyglot embedding based method that requires minimal supervision. In the context of the 2019 India-Pakistan conflict triggered by the Pulwama terror attack, we demonstrate an untapped potential of harnessing code mixing for human well-being: starting from an existing hostility diffusing \emph{hope speech} classifier solely trained on English documents, code mixed documents are utilized as a bridge to retrieve \emph{hope speech} content written in a low-resource but widely used language - Romanized Hindi. Our proposed pipeline requires minimal supervision and holds promise in substantially reducing web moderation efforts.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源