夸克：可控的文本生成，并用加固的学习

论文标题

夸克：可控的文本生成，并用加固的学习

Quark: Controllable Text Generation with Reinforced Unlearning

论文作者

Lu, Ximing, Welleck, Sean, Hessel, Jack, Jiang, Liwei, Qin, Lianhui, West, Peter, Ammanabrolu, Prithviraj, Choi, Yejin

论文摘要

大规模的语言模型通常会学习与用户期望未对准的行为。生成的文本可能包含令人反感或有毒语言，包含重大重复，或者与用户所需的情绪不同。我们考虑通过在不做什么的信号上微调语言模型来学习这些未对准的任务。我们介绍了量化的奖励konditioning（Quark），这是一种算法，用于优化量量化（联合国）想要的属性的奖励功能，同时又不偏离原始模型。夸克（Quark）在（i）以当前语言模型收集样品之间的交替，（ii）根据奖励将它们整理成分位数，每个分位数都通过奖励标记来确定为语言模型的输入，以及（iii）使用根据其奖励标记的每个分位数中的样本损失的标准语言建模损失，同时通过kl-demands narry Original语言模型，同时通过KL-DENCORTIN not a Kl-dearty notive a kl-dearty a kl-dearty a kl-dearty。通过在一代人的高度代币上进行调节，该模型生成的文本较少表现出不需要的属性。为了进行毒性，负面情绪和重复，我们的实验表明，夸克的表现优于强大的基准和最先进的强化学习方法，例如PPO（Schulman等人，2017年），而仅依靠标准语言建模原始人。

Large-scale language models often learn behaviors that are misaligned with user expectations. Generated text may contain offensive or toxic language, contain significant repetition, or be of a different sentiment than desired by the user. We consider the task of unlearning these misalignments by fine-tuning the language model on signals of what not to do. We introduce Quantized Reward Konditioning (Quark), an algorithm for optimizing a reward function that quantifies an (un)wanted property, while not straying too far from the original model. Quark alternates between (i) collecting samples with the current language model, (ii) sorting them into quantiles based on reward, with each quantile identified by a reward token prepended to the language model's input, and (iii) using a standard language modeling loss on samples from each quantile conditioned on its reward token, while remaining nearby the original language model via a KL-divergence penalty. By conditioning on a high-reward token at generation time, the model generates text that exhibits less of the unwanted property. For unlearning toxicity, negative sentiment, and repetition, our experiments show that Quark outperforms both strong baselines and state-of-the-art reinforcement learning methods like PPO (Schulman et al. 2017), while relying only on standard language modeling primitives.

下载PDF全文

下载文献需遵守相关版权规定

论文标题