论文标题
EUR-LEX-SUM:用于法定领域的长形式摘要的多种多样数据集
EUR-Lex-Sum: A Multi- and Cross-lingual Dataset for Long-form Summarization in the Legal Domain
论文作者
论文摘要
现有的摘要数据集带有两个主要缺点:(1)它们倾向于专注于过度曝光的域,例如新闻文章或类似Wiki的文本,并且(2)主要是单语,很少有多语言数据集。在这项工作中,我们根据欧盟法律平台(Eur-Lex)的手动策划文件摘要提出了一个名为Eur-Lex-SUM的新型数据集,称为Eur-Lex-SUM。文档及其各自的摘要作为欧洲24种官方语言中的几种跨语性段落的数据存在,从而可以访问各种跨语言和低资源的摘要设置。我们每种语言最多可获得1,500个文档/摘要对,其中包括375个跨语法法律行为的子集,其中包含所有24种语言的文本。在这项工作中,数据采集过程详细介绍,并将资源的关键特征与现有的摘要资源进行比较。特别是,我们说明了在数据集上具有挑战性的子问题和开放问题,这些问题可以帮助促进未来的研究以特定领域的跨语性摘要方向来促进未来的研究。受样本的极端长度和语言多样性的限制,我们进一步进行实验,以适当的提取单语言和跨语性基准进行未来的工作。提取的代码以及对我们的数据和基线的访问,请访问:https://github.com/achouhan93/eur-lex-sum。
Existing summarization datasets come with two main drawbacks: (1) They tend to focus on overly exposed domains, such as news articles or wiki-like texts, and (2) are primarily monolingual, with few multilingual datasets. In this work, we propose a novel dataset, called EUR-Lex-Sum, based on manually curated document summaries of legal acts from the European Union law platform (EUR-Lex). Documents and their respective summaries exist as cross-lingual paragraph-aligned data in several of the 24 official European languages, enabling access to various cross-lingual and lower-resourced summarization setups. We obtain up to 1,500 document/summary pairs per language, including a subset of 375 cross-lingually aligned legal acts with texts available in all 24 languages. In this work, the data acquisition process is detailed and key characteristics of the resource are compared to existing summarization resources. In particular, we illustrate challenging sub-problems and open questions on the dataset that could help the facilitation of future research in the direction of domain-specific cross-lingual summarization. Limited by the extreme length and language diversity of samples, we further conduct experiments with suitable extractive monolingual and cross-lingual baselines for future work. Code for the extraction as well as access to our data and baselines is available online at: https://github.com/achouhan93/eur-lex-sum.