TED：具有主题建模和denoing的预定的无监督摘要模型

论文标题

TED：具有主题建模和denoing的预定的无监督摘要模型

TED: A Pretrained Unsupervised Summarization Model with Theme Modeling and Denoising

论文作者

Yang, Ziyi, Zhu, Chenguang, Gmyr, Robert, Zeng, Michael, Huang, Xuedong, Darve, Eric

论文摘要

文本摘要旨在从文本中提取基本信息，并将文本转换为简洁的版本。现有的无监督抽象摘要模型利用了复发性神经网络框架，而最近提出的变压器具有更大的能力。此外，以前的大多数汇总模型都忽略了可用于预处理的丰富未标记的Corpora资源。为了解决这些问题，我们提出了TED，这是一种基于变压器的无监督抽象摘要系统，对大规模数据进行了预处理。我们首先利用新闻文章中的主要偏见，以数百万个未标记的语料库为型号预算。接下来，我们通过主题建模和DeNoing AutoCoder在目标域上进行了限制，以提高生成的摘要的质量。值得注意的是，TED在NYT，CNN/DM和English Gigaword数据集上优于所有无监督的抽象基线，并具有各种文档样式。进一步的分析表明，TED产生的摘要具有高度抽象的性能，而TED的目标函数中的每个组件都是非常有效的。

Text summarization aims to extract essential information from a piece of text and transform the text into a concise version. Existing unsupervised abstractive summarization models leverage recurrent neural networks framework while the recently proposed transformer exhibits much more capability. Moreover, most of previous summarization models ignore abundant unlabeled corpora resources available for pretraining. In order to address these issues, we propose TED, a transformer-based unsupervised abstractive summarization system with pretraining on large-scale data. We first leverage the lead bias in news articles to pretrain the model on millions of unlabeled corpora. Next, we finetune TED on target domains through theme modeling and a denoising autoencoder to enhance the quality of generated summaries. Notably, TED outperforms all unsupervised abstractive baselines on NYT, CNN/DM and English Gigaword datasets with various document styles. Further analysis shows that the summaries generated by TED are highly abstractive, and each component in the objective function of TED is highly effective.

下载PDF全文

下载文献需遵守相关版权规定

论文标题