论文标题
Clidsum:用于跨语言对话摘要的基准数据集
ClidSum: A Benchmark Dataset for Cross-Lingual Dialogue Summarization
论文作者
论文摘要
我们提出Clidsum,这是一个基准数据集,用于在对话文档上构建跨语义摘要系统。它由来自两个子集(即Samsum和Mediasum)的67K+对话文档和不同目标语言的112K+带注释的摘要组成。根据拟议的插曲,我们分别介绍了两个基准设置,分别用于监督和半监督场景。然后,我们在不同的范式(管道和端到端)中构建各种基线系统,并在插曲上进行广泛的实验以提供更深入的分析。此外,我们提出了通过进一步的预训练来扩展MBART-50(多语言BART)的MDialbart。在进一步的训练阶段中使用的多个目标有助于预训练的模型捕获对话中的结构特征以及重要内容以及从源到目标语言的转换。实验结果表明,作为端到端模型,Mdialbart的优越性优于Clidsum上的强大管道模型。最后,我们讨论了当前方法在此任务中面临的特定挑战,并为未来的研究提供了多个有希望的方向。我们已经在https://github.com/krystalan/clidsum上发布了数据集和代码。
We present ClidSum, a benchmark dataset for building cross-lingual summarization systems on dialogue documents. It consists of 67k+ dialogue documents from two subsets (i.e., SAMSum and MediaSum) and 112k+ annotated summaries in different target languages. Based on the proposed ClidSum, we introduce two benchmark settings for supervised and semi-supervised scenarios, respectively. We then build various baseline systems in different paradigms (pipeline and end-to-end) and conduct extensive experiments on ClidSum to provide deeper analyses. Furthermore, we propose mDialBART which extends mBART-50 (a multi-lingual BART) via further pre-training. The multiple objectives used in the further pre-training stage help the pre-trained model capture the structural characteristics as well as important content in dialogues and the transformation from source to the target language. Experimental results show the superiority of mDialBART, as an end-to-end model, outperforms strong pipeline models on ClidSum. Finally, we discuss specific challenges that current approaches faced with this task and give multiple promising directions for future research. We have released the dataset and code at https://github.com/krystalan/ClidSum.