通过离线加强学习以人为中心的对话培训

论文标题

通过离线加强学习以人为中心的对话培训

Human-centric Dialog Training via Offline Reinforcement Learning

论文作者

Jaques, Natasha, Shen, Judy Hanwen, Ghandeharioun, Asma, Ferguson, Craig, Lapedriza, Agata, Jones, Noah, Gu, Shixiang Shane, Picard, Rosalind

论文摘要

我们如何训练对话模型通过从人类的反馈中学习，而没有人类教会其有害聊天行为的风险来进行更好的对话？我们首先在线托管模型，并从实时，开放式对话中收集人类反馈，然后使用离线增强学习（RL）来训练和改进模型。我们确定了隐性的对话线索，包括语言相似性，笑声，情感等等，这表明了人类的积极反馈，并将这些反馈嵌入了多种奖励功能中。一个众所周知的挑战是，在离线环境中学习RL政策通常是由于缺乏探索能力以及对未来奖励的过度估计的趋势而失败。将RL用于语言模型时，这些问题变得更加困难，该语言模型很容易具有20,000个动作词汇和许多可能的奖励功能。我们通过开发一种新颖的离线RL算法来解决挑战。这些算法使用KL-Control来惩罚与先前训练的先前语言模型的差异，并使用新的策略使算法在面对不确定性时使算法悲观而不是乐观。我们测试了由80个用户在开放域设置中的评分的结果对话框模型，发现它比现有的Deep Offline RL方法可以取得重大改进。新颖的离线RL方法可行，可使用人类反馈的静态数据集改进任何现有的生成对话模型。

How can we train a dialog model to produce better conversations by learning from human feedback, without the risk of humans teaching it harmful chat behaviors? We start by hosting models online, and gather human feedback from real-time, open-ended conversations, which we then use to train and improve the models using offline reinforcement learning (RL). We identify implicit conversational cues including language similarity, elicitation of laughter, sentiment, and more, which indicate positive human feedback, and embed these in multiple reward functions. A well-known challenge is that learning an RL policy in an offline setting usually fails due to the lack of ability to explore and the tendency to make over-optimistic estimates of future reward. These problems become even harder when using RL for language models, which can easily have a 20,000 action vocabulary and many possible reward functions. We solve the challenge by developing a novel class of offline RL algorithms. These algorithms use KL-control to penalize divergence from a pre-trained prior language model, and use a new strategy to make the algorithm pessimistic, instead of optimistic, in the face of uncertainty. We test the resulting dialog model with ratings from 80 users in an open-domain setting and find it achieves significant improvements over existing deep offline RL approaches. The novel offline RL method is viable for improving any existing generative dialog model using a static dataset of human feedback.

下载PDF全文

下载文献需遵守相关版权规定

论文标题