增强学习的奖励报告

论文标题

增强学习的奖励报告

Reward Reports for Reinforcement Learning

论文作者

Gilbert, Thomas Krendl, Lambert, Nathan, Dean, Sarah, Zick, Tom, Snoswell, Aaron

论文摘要

面对复杂的社会影响，对社会有益的建筑系统需要动态的方法。最近的机器学习方法（ML）文档证明了对这些复杂性进行审议的话语框架的希望。但是，这些事态发展以静态的ML范式为基础，使反馈和剥离后绩效的作用未经检查。同时，最新的增强学习研究表明，反馈和优化目标对系统行为的影响可能是广泛的且无法预测的。在本文中，我们绘制一个框架，用于记录部署和迭代更新的学习系统，我们称之为奖励报告。从各种贡献中汲取灵感到有关强化学习的技术文献，我们将奖励报告概述为生活文档，这些文件跟踪了针对特定自动化系统优化的内容的设计选择和假设的更新。它们旨在跟踪系统部署引起的动态现象，而不仅仅是模型或数据的静态特性。介绍了奖励报告的元素后，我们讨论了一个具体示例：Meta的Blenderbot 3 Chatbot。附录中包括其他几个用于游戏玩法（DeepMind的Muzero），内容建议（Movielens）和交通控制（项目流）。

Building systems that are good for society in the face of complex societal effects requires a dynamic approach. Recent approaches to machine learning (ML) documentation have demonstrated the promise of discursive frameworks for deliberation about these complexities. However, these developments have been grounded in a static ML paradigm, leaving the role of feedback and post-deployment performance unexamined. Meanwhile, recent work in reinforcement learning has shown that the effects of feedback and optimization objectives on system behavior can be wide-ranging and unpredictable. In this paper we sketch a framework for documenting deployed and iteratively updated learning systems, which we call Reward Reports. Taking inspiration from various contributions to the technical literature on reinforcement learning, we outline Reward Reports as living documents that track updates to design choices and assumptions behind what a particular automated system is optimizing for. They are intended to track dynamic phenomena arising from system deployment, rather than merely static properties of models or data. After presenting the elements of a Reward Report, we discuss a concrete example: Meta's BlenderBot 3 chatbot. Several others for game-playing (DeepMind's MuZero), content recommendation (MovieLens), and traffic control (Project Flow) are included in the appendix.

下载PDF全文

下载文献需遵守相关版权规定

论文标题