通过从演示中学习的文字生成

论文标题

通过从演示中学习的文字生成

Text Generation by Learning from Demonstrations

论文作者

Pang, Richard Yuanzhe, He, He

论文摘要

当前的文本生成方法很大程度上依赖于自回旋模型和最大似然估计。由于学习目标不匹配的目标和评估指标（可能性与质量）和（ii）由于历史分布不匹配（金与模型生成）而导致的暴露偏见，因此这种范式导致（i）多样化但低质量的样本。为了减轻这些问题，我们将文本生成作为离线加强学习（RL）问题（即专家演示）（即参考），目的是最大程度地提高给定模型生成的历史的质量。我们提出黄金（通过非政策从演示中学习产生）：一种易于优化的算法，通过重要性加权从演示中学习。在培训期间，凭直觉上，金牌上行的代币和减小的代币不受欢迎，避免了依赖在线数据收集的先前RL方法所面临的优化问题。根据自动评估和人类评估，由Gold训练的模型优于MLE和政策梯度训练的模型，以及有关汇总，问题产生和机器翻译的模型。此外，我们的模型对解码算法和减轻暴露偏差的敏感性不太敏感。

Current approaches to text generation largely rely on autoregressive models and maximum likelihood estimation. This paradigm leads to (i) diverse but low-quality samples due to mismatched learning objective and evaluation metric (likelihood vs. quality) and (ii) exposure bias due to mismatched history distributions (gold vs. model-generated). To alleviate these problems, we frame text generation as an offline reinforcement learning (RL) problem with expert demonstrations (i.e., the reference), where the goal is to maximize quality given model-generated histories. We propose GOLD (generation by off-policy learning from demonstrations): an easy-to-optimize algorithm that learns from the demonstrations by importance weighting. Intuitively, GOLD upweights confident tokens and downweights unconfident ones in the reference during training, avoiding optimization issues faced by prior RL approaches that rely on online data collection. According to both automatic and human evaluation, models trained by GOLD outperform those trained by MLE and policy gradient on summarization, question generation, and machine translation. Further, our models are less sensitive to decoding algorithms and alleviate exposure bias.

下载PDF全文

下载文献需遵守相关版权规定

论文标题