联盟：一个未参考的指标，用于评估开放式故事的产生

论文标题

联盟：一个未参考的指标，用于评估开放式故事的产生

UNION: An Unreferenced Metric for Evaluating Open-ended Story Generation

论文作者

Guan, Jian, Huang, Minlie

论文摘要

尽管现有的参考指标（例如BLEU和MOVERSCORE）成功，但由于人类对开放式文本生成的判断，包括故事或对话的生成（包括一对一的一对一问题）的判断很差：对于同一输入而言，臭名昭著的输出有许多可行的输出，在文用或语义上可能与限制的授权文字中的文字或语义上有很大的不同。为了减轻这个问题，我们提出了一个可学习的无参考的指标，用于评估开放式故事的产生，该指标可以衡量生成的故事的质量，而无需任何参考。联合会培训以伯特的顶部为基础，以区分人文所写的故事和消极样本，并在负面故事中恢复扰动。我们提出了一种通过模仿现有NLG模型中通常观察到的误差来构建负样本的方法，包括重复的图，冲突逻辑和远距离不一致。两个故事数据集的实验表明，联合是评估生成故事质量的可靠措施，这与人类判断更好，并且比现有的最新指标更具普遍性。

Despite the success of existing referenced metrics (e.g., BLEU and MoverScore), they correlate poorly with human judgments for open-ended text generation including story or dialog generation because of the notorious one-to-many issue: there are many plausible outputs for the same input, which may differ substantially in literal or semantics from the limited number of given references. To alleviate this issue, we propose UNION, a learnable unreferenced metric for evaluating open-ended story generation, which measures the quality of a generated story without any reference. Built on top of BERT, UNION is trained to distinguish human-written stories from negative samples and recover the perturbation in negative stories. We propose an approach of constructing negative samples by mimicking the errors commonly observed in existing NLG models, including repeated plots, conflicting logic, and long-range incoherence. Experiments on two story datasets demonstrate that UNION is a reliable measure for evaluating the quality of generated stories, which correlates better with human judgments and is more generalizable than existing state-of-the-art metrics.

下载PDF全文

下载文献需遵守相关版权规定

论文标题