使用树自动编码器对话语结构的无监督学习

论文标题

使用树自动编码器对话语结构的无监督学习

Unsupervised Learning of Discourse Structures using a Tree Autoencoder

论文作者

Huber, Patrick, Carenini, Giuseppe

论文摘要

通过流行的话语理论（例如RST和PDTB）所推测的话语信息已被证明可以改善越来越多的下游NLP任务，显示出积极的效果和与重要的现实世界应用程序的讨论协同作用。虽然融合话语的方法变得越来越复杂，但当前的话语解析器尚未充分满足对强大和一般话语结构的日益增长的需求，通常在严格有限的域中接受小型数据集接受培训。这使得对任意任务的预测嘈杂且不可靠。总体而言，缺乏高质量的高质量话语树木为进一步的进步构成了严重的限制。为了减轻这一缺点，我们提出了一种新的策略，通过扩展具有自动编码目标的潜在树归纳框架，以任务无关，无监督的方式生成树结构。提出的方法可以应用于任何树结构的目标，例如句法解析，话语解析等。但是，由于产生话语树的特别困难注释过程，我们最初开发了一种生成更大和更多样化的话语树库的方法。在本文中，我们推断了多个领域中自然文本的一般树结构，在各种任务中显示出令人鼓舞的结果。

Discourse information, as postulated by popular discourse theories, such as RST and PDTB, has been shown to improve an increasing number of downstream NLP tasks, showing positive effects and synergies of discourse with important real-world applications. While methods for incorporating discourse become more and more sophisticated, the growing need for robust and general discourse structures has not been sufficiently met by current discourse parsers, usually trained on small scale datasets in a strictly limited number of domains. This makes the prediction for arbitrary tasks noisy and unreliable. The overall resulting lack of high-quality, high-quantity discourse trees poses a severe limitation to further progress. In order the alleviate this shortcoming, we propose a new strategy to generate tree structures in a task-agnostic, unsupervised fashion by extending a latent tree induction framework with an auto-encoding objective. The proposed approach can be applied to any tree-structured objective, such as syntactic parsing, discourse parsing and others. However, due to the especially difficult annotation process to generate discourse trees, we initially develop a method to generate larger and more diverse discourse treebanks. In this paper we are inferring general tree structures of natural text in multiple domains, showing promising results on a diverse set of tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题