解决数学单词问题与基于过程和结果的反馈

论文标题

解决数学单词问题与基于过程和结果的反馈

Solving math word problems with process- and outcome-based feedback

论文作者

Uesato, Jonathan, Kushman, Nate, Kumar, Ramana, Song, Francis, Siegel, Noah, Wang, Lisa, Creswell, Antonia, Irving, Geoffrey, Higgins, Irina

论文摘要

最近的工作表明，要求语言模型生成推理步骤可改善许多推理任务的性能。当超越提示时，这就提出了一个问题，即我们应该如何监督这些模型：基于结果的方法来监督最终结果或基于过程的方法，以监督推理过程本身？这些方法之间的差异自然可能不仅在最终回答错误中，而且在推理错误中也可以预期，这些错误可能很难检测到，并且在许多现实世界中的教育等许多现实世界中都存在问题。我们进行了对自然语言任务GSM8K培训的基于过程和结果的方法之间的第一个全面比较。我们发现，纯粹的基于结果的监督会产生类似的最终错误率，而标签监督较少。但是，对于正确的推理步骤，我们发现有必要使用基于过程的监督或从模仿基于过程的反馈的学习奖励模型的监督。总的来说，我们将以前的最佳结果从16.8％的$ \提高到$ 12.7％的最终回答错误，在最终解答更正的解决方案中的推理错误14.0％$ \ $ 3.4％。

Recent work has shown that asking language models to generate reasoning steps improves performance on many reasoning tasks. When moving beyond prompting, this raises the question of how we should supervise such models: outcome-based approaches which supervise the final result, or process-based approaches which supervise the reasoning process itself? Differences between these approaches might naturally be expected not just in final-answer errors but also in reasoning errors, which can be difficult to detect and are problematic in many real-world domains such as education. We run the first comprehensive comparison between process- and outcome-based approaches trained on a natural language task, GSM8K. We find that pure outcome-based supervision produces similar final-answer error rates with less label supervision. However, for correct reasoning steps we find it necessary to use process-based supervision or supervision from learned reward models that emulate process-based feedback. In total, we improve the previous best results from 16.8% $\to$ 12.7% final-answer error and 14.0% $\to$ 3.4% reasoning error among final-answer-correct solutions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题