用于文本摘要的印度语言数据集的概述

论文标题

用于文本摘要的印度语言数据集的概述

An Overview of Indian Language Datasets used for Text Summarization

论文作者

Sinha, Shagun, Jha, Girish Nath

论文摘要

在本文中，我们用印度语言（ILS）调查文本摘要（TS）数据集，这些数据集也是低资源语言（LRLS）。我们试图回答一个主要问题：印度语言文本摘要库（ILTS）数据集的增长还是资源贫困？为了解决一个主要问题，我们提出了两个有关ILTS数据集的子问题：首先，哪些特征：格式和域ILTS数据集具有？其次，来自高资源语言（HRLS），尤其是英语的ILT数据集的这些特征有何不同。我们专注于2012 - 2022年发表的ILTS研究工作中报告的数据集。对ILT和英语数据集的调查显示了两个相似之处和一个对比度。这两个相似之处是：首先，数据集的域通常是新闻（Hermann等，2015）。第二个相似性是数据集的格式，既是提取又抽象的。对比在于数据集开发中的研究如何进行。与英语相比，ILS面临慢速发展速度和数据集的公开发布。我们认为，ILTS数据集的数量相对较低是由于两个原因：首先，缺少用于开发TS工具和资源的专用论坛；其次，在公共领域缺乏可共享的标准数据集。

In this paper, we survey Text Summarization (TS) datasets in Indian Languages (ILs), which are also low-resource languages (LRLs). We seek to answer one primary question: is the pool of Indian Language Text Summarization (ILTS) dataset growing or is there a resource poverty? To an-swer the primary question, we pose two sub-questions that we seek about ILTS datasets: first, what characteristics: format and domain do ILTS datasets have? Second, how different are those characteristics of ILTS datasets from high-resource languages (HRLs) particularly English. We focus on datasets reported in published ILTS research works during 2012-2022. The survey of ILTS and English datasets reveals two similarities and one contrast. The two similarities are: first, the domain of dataset commonly is news (Hermann et al., 2015). The second similarity is the format of the dataset which is both extractive and abstractive. The contrast is in how the research in dataset development has progressed. ILs face a slow speed of development and public release of datasets as compared with English. We argue that the relatively lower number of ILTS datasets is because of two reasons: first, absence of a dedicated forum for developing TS tools and resources; and second, lack of shareable standard datasets in the public domain.

下载PDF全文

下载文献需遵守相关版权规定

论文标题