通过minimax分布软角色批评，改善对增强学习的概括

论文标题

通过minimax分布软角色批评，改善对增强学习的概括

Improving Generalization of Reinforcement Learning with Minimax Distributional Soft Actor-Critic

论文作者

Ren, Yangang, Duan, Jingliang, Li, Shengbo Eben, Guan, Yang, Sun, Qi

论文摘要

强化学习（RL）在众多顺序决策和控制任务中取得了出色的表现。但是，一个普遍的问题是，学到的几乎最佳政策总是过于适应培训环境，并且可能不会扩展到培训期间从未遇到的情况。对于实际应用，环境的随机性通常会导致一些毁灭性的事件，这应该是自动驾驶等安全系统的重点。在本文中，我们介绍了Minimax公式和分布框架，以提高RL算法的概括能力，并开发最小值分布软性角色 - 参数批评（Minimax DSAC）算法。 Minimax配方旨在考虑到环境最严重的变化，在该政策中，主角政策最大化行动价值功能，而对手政策则试图最大程度地减少它。分销框架旨在学习国家行动回报分布，我们可以从中明确对不同回报的风险进行建模，从而制定规避风险的主角政策和寻求风险的对抗性政策。我们在交叉点上实施有关自动驾驶汽车决策任务的方法，并在不同环境中测试训练有素的政策。结果表明，我们的方法可以大大提高主角对不同环境变化的概括能力。

Reinforcement learning (RL) has achieved remarkable performance in numerous sequential decision making and control tasks. However, a common problem is that learned nearly optimal policy always overfits to the training environment and may not be extended to situations never encountered during training. For practical applications, the randomness of environment usually leads to some devastating events, which should be the focus of safety-critical systems such as autonomous driving. In this paper, we introduce the minimax formulation and distributional framework to improve the generalization ability of RL algorithms and develop the Minimax Distributional Soft Actor-Critic (Minimax DSAC) algorithm. Minimax formulation aims to seek optimal policy considering the most severe variations from environment, in which the protagonist policy maximizes action-value function while the adversary policy tries to minimize it. Distributional framework aims to learn a state-action return distribution, from which we can model the risk of different returns explicitly, thereby formulating a risk-averse protagonist policy and a risk-seeking adversarial policy. We implement our method on the decision-making tasks of autonomous vehicles at intersections and test the trained policy in distinct environments. Results demonstrate that our method can greatly improve the generalization ability of the protagonist agent to different environmental variations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题