论文标题

从低资源语言中的技术干预中学习:增强贡迪的信息访问

Learnings from Technological Interventions in a Low Resource Language: Enhancing Information Access in Gondi

论文作者

Mehta, Devansh, Diddee, Harshita, Saxena, Ananya, Shukla, Anurag, Santy, Sebastin, Mothilal, Ramaravind Kommiya, Srivastava, Brij Mohan Lal, Sharma, Alok, Prasad, Vishnu, U, Venkanna, Bali, Kalika

论文摘要

为低资源语言开发技术的主要障碍是缺乏代表性的可用数据。在本文中,我们报告了技术驱动的数据收集方法的部署,该方法用于创建从印地语到贡迪的60,000多个翻译的语料库,这是印度南部和中部约有230万部落人士所说的低资源易受伤害的语言。在此过程中,我们有助于扩展贡迪的信息访问在两个不同的维度上(a)社区可以使用的语言资源,例如词典,儿童故事,来自多个来源的冈迪翻译以及基于交互式语音响应(IVR)基于互动的群众意识平台; (b)通过开发印地语 - 贡迪机器翻译模型来启用其在数字域中的使用,该模型被近4次压缩,以使其在低资源边缘设备上的边缘部署以及几乎没有互联网连接的领域。我们还提出了利用开发的机器翻译模型的初步评估,为参与收集目标语言数据的志愿者提供帮助。通过这些干预措施,我们不仅创建了26,240个印地语翻译的精致和评估的语料库,该译本用于构建翻译模型,而且还吸引了近850名社区成员,他们可以帮助将Gondi带入Internet。

The primary obstacle to developing technologies for low-resource languages is the lack of representative, usable data. In this paper, we report the deployment of technology-driven data collection methods for creating a corpus of more than 60,000 translations from Hindi to Gondi, a low-resource vulnerable language spoken by around 2.3 million tribal people in south and central India. During this process, we help expand information access in Gondi across 2 different dimensions (a) The creation of linguistic resources that can be used by the community, such as a dictionary, children's stories, Gondi translations from multiple sources and an Interactive Voice Response (IVR) based mass awareness platform; (b) Enabling its use in the digital domain by developing a Hindi-Gondi machine translation model, which is compressed by nearly 4 times to enable it's edge deployment on low-resource edge devices and in areas of little to no internet connectivity. We also present preliminary evaluations of utilizing the developed machine translation model to provide assistance to volunteers who are involved in collecting more data for the target language. Through these interventions, we not only created a refined and evaluated corpus of 26,240 Hindi-Gondi translations that was used for building the translation model but also engaged nearly 850 community members who can help take Gondi onto the internet.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源