论文标题
MDIA:以46种语言的多语言对话生成的基准
MDIA: A Benchmark for Multilingual Dialogue Generation in 46 Languages
论文作者
论文摘要
由于缺乏低资源语言的语料库,当前的对话生成作品主要集中在英语上。在本文中,我们介绍了MDIA,这是第一个大规模的多语言基准,用于跨低资源语言的对话生成。它涵盖了19个语言家庭中46种语言的现实生活对话。我们介绍通过微调多语言,非拨号的预训练的MT5以及以英语为中心的,以对话为中心的预训练的聊天机器人对话进行的基线结果。结果表明,基于MT5的模型在Sacrebleu和Bertscore上的表现更好,但在多样性方面的表现较差。即使在几次射击和零拍的场景中发现了有希望的结果,但英语和其他语言的发电质量之间存在很大的差距。我们希望MDIA的发布可以鼓励更多关于多语言对话生成的作品,以促进语言多样性。
Owing to the lack of corpora for low-resource languages, current works on dialogue generation have mainly focused on English. In this paper, we present mDIA, the first large-scale multilingual benchmark for dialogue generation across low- to high-resource languages. It covers real-life conversations in 46 languages across 19 language families. We present baseline results obtained by fine-tuning the multilingual, non-dialogue-focused pre-trained model mT5 as well as English-centric, dialogue-focused pre-trained chatbot DialoGPT. The results show that mT5-based models perform better on sacreBLEU and BertScore but worse on diversity. Even though promising results are found in few-shot and zero-shot scenarios, there is a large gap between the generation quality in English and other languages. We hope that the release of mDIA could encourage more works on multilingual dialogue generation to promote language diversity.