通过多代理增强学习的交流学习的离散化方法的分析

论文标题

通过多代理增强学习的交流学习的离散化方法的分析

An Analysis of Discretization Methods for Communication Learning with Multi-Agent Reinforcement Learning

论文作者

Vanneste, Astrid, Vanneste, Simon, Mets, Kevin, De Schepper, Tom, Mercelis, Siegfried, Latré, Steven, Hellinckx, Peter

论文摘要

当代理无法观察到环境状态时，沟通对于多机构增强学习至关重要。允许代理之间学习通信的最常见方法是使用可区分的通信通道，该通信通道允许梯度作为反馈形式在代理之间流动。但是，当我们想使用离散消息来减少消息大小时，这是一项挑战，因为梯度无法流过离散的通信通道。以前的工作提出了解决此问题的方法。但是，这些方法在不同的交流学习架构和环境中进行了测试，因此很难比较它们。在本文中，我们比较了几种最新的离散方法以及两种以前尚未用于交流学习的方法。我们在交流学习的背景下使用来自其他代理的梯度进行了比较，并在几种环境上进行测试。我们的结果表明，在所有环境中，这些方法都不是最好的。离散方法的最佳选择在很大程度上取决于环境。但是，直接穿过DRU和直接穿过Gumbel SoftMax的离散化正规化单元（DRU）在所有测试环境中显示出最一致的结果。因此，这些方法被证明是一般使用的最佳选择，而直通估算器和牙龈软智能可能会在特定环境中提供更好的结果，但在其他环境中完全失败。

Communication is crucial in multi-agent reinforcement learning when agents are not able to observe the full state of the environment. The most common approach to allow learned communication between agents is the use of a differentiable communication channel that allows gradients to flow between agents as a form of feedback. However, this is challenging when we want to use discrete messages to reduce the message size since gradients cannot flow through a discrete communication channel. Previous work proposed methods to deal with this problem. However, these methods are tested in different communication learning architectures and environments, making it hard to compare them. In this paper, we compare several state-of-the-art discretization methods as well as two methods that have not been used for communication learning before. We do this comparison in the context of communication learning using gradients from other agents and perform tests on several environments. Our results show that none of the methods is best in all environments. The best choice in discretization method greatly depends on the environment. However, the discretize regularize unit (DRU), straight through DRU and the straight through gumbel softmax show the most consistent results across all the tested environments. Therefore, these methods prove to be the best choice for general use while the straight through estimator and the gumbel softmax may provide better results in specific environments but fail completely in others.

下载PDF全文

下载文献需遵守相关版权规定

论文标题