托托：受控的表与文本生成数据集

论文标题

托托：受控的表与文本生成数据集

ToTTo: A Controlled Table-To-Text Generation Dataset

论文作者

Parikh, Ankur P., Wang, Xuezhi, Gehrmann, Sebastian, Faruqui, Manaal, Dhingra, Bhuwan, Yang, Diyi, Das, Dipanjan

论文摘要

我们提出托托（Totto），这是一个开放域英语表到文本数据集，其中包含超过120,000个培训示例，这些培训示例提出了一个受控的生成任务：给定Wikipedia表和一组突出的表单元，产生了单句话描述。为了获得自然而忠实于源表的生成的目标，我们介绍了一个数据集构造过程，注释者直接修改了Wikipedia的现有候选句子。我们介绍了对数据集和注释过程的系统分析，以及几个最先进的基线的结果。虽然通常会流利，但现有的方法通常会幻觉词组，这些短语不受桌子的支持，这表明该数据集可以作为高精度条件文本生成的有用研究基准。

We present ToTTo, an open-domain English table-to-text dataset with over 120,000 training examples that proposes a controlled generation task: given a Wikipedia table and a set of highlighted table cells, produce a one-sentence description. To obtain generated targets that are natural but also faithful to the source table, we introduce a dataset construction process where annotators directly revise existing candidate sentences from Wikipedia. We present systematic analyses of our dataset and annotation process as well as results achieved by several state-of-the-art baselines. While usually fluent, existing methods often hallucinate phrases that are not supported by the table, suggesting that this dataset can serve as a useful research benchmark for high-precision conditional text generation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题