MDIA：以46种语言的多语言对话生成的基准

论文标题

MDIA：以46种语言的多语言对话生成的基准

MDIA: A Benchmark for Multilingual Dialogue Generation in 46 Languages

论文作者

Zhang, Qingyu, Shen, Xiaoyu, Chang, Ernie, Ge, Jidong, Chen, Pengke

论文摘要

由于缺乏低资源语言的语料库，当前的对话生成作品主要集中在英语上。在本文中，我们介绍了MDIA，这是第一个大规模的多语言基准，用于跨低资源语言的对话生成。它涵盖了19个语言家庭中46种语言的现实生活对话。我们介绍通过微调多语言，非拨号的预训练的MT5以及以英语为中心的，以对话为中心的预训练的聊天机器人对话进行的基线结果。结果表明，基于MT5的模型在Sacrebleu和Bertscore上的表现更好，但在多样性方面的表现较差。即使在几次射击和零拍的场景中发现了有希望的结果，但英语和其他语言的发电质量之间存在很大的差距。我们希望MDIA的发布可以鼓励更多关于多语言对话生成的作品，以促进语言多样性。

Owing to the lack of corpora for low-resource languages, current works on dialogue generation have mainly focused on English. In this paper, we present mDIA, the first large-scale multilingual benchmark for dialogue generation across low- to high-resource languages. It covers real-life conversations in 46 languages across 19 language families. We present baseline results obtained by fine-tuning the multilingual, non-dialogue-focused pre-trained model mT5 as well as English-centric, dialogue-focused pre-trained chatbot DialoGPT. The results show that mT5-based models perform better on sacreBLEU and BertScore but worse on diversity. Even though promising results are found in few-shot and zero-shot scenarios, there is a large gap between the generation quality in English and other languages. We hope that the release of mDIA could encourage more works on multilingual dialogue generation to promote language diversity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题