零射协调的模棱两可的网络

论文标题

零射协调的模棱两可的网络

Equivariant Networks for Zero-Shot Coordination

论文作者

Muglich, Darius, de Witt, Christian Schroeder, van der Pol, Elise, Whiteson, Shimon, Foerster, Jakob

论文摘要

DecOmdps的成功协调要求代理商为伴侣采用强大的策略和可解释的游戏风格。当代理在许多等效但不相容的策略中任意收敛时，一种常见的故障模式是对称破坏。这些例子通常包括部分可观察性，例如挥舞右手与左手传达秘密消息。在本文中，我们提出了一种新型的模棱两可的网络体系结构，可在DECOMDP中使用，该网络架构有效利用环境对称性来改善零射击协调，比以前的方法更有效地做到这一点。我们的方法还充当了通用，预训练的策略的``协调改造操作员''，因此可以在测试时间与任何自我播放算法一起应用。我们提供了工作的理论保证，并对Hanabi的AI基准任务进行了测试，在那里我们证明了在零摄影协调方面的表现优于其他对称性意识的基准，并能够提高各种预先训练的策略的协调能力。特别是，我们表明我们的方法可用于改善Hanabi基准测试的零拍摄协调状态。

Successful coordination in Dec-POMDPs requires agents to adopt robust strategies and interpretable styles of play for their partner. A common failure mode is symmetry breaking, when agents arbitrarily converge on one out of many equivalent but mutually incompatible policies. Commonly these examples include partial observability, e.g. waving your right hand vs. left hand to convey a covert message. In this paper, we present a novel equivariant network architecture for use in Dec-POMDPs that effectively leverages environmental symmetry for improving zero-shot coordination, doing so more effectively than prior methods. Our method also acts as a ``coordination-improvement operator'' for generic, pre-trained policies, and thus may be applied at test-time in conjunction with any self-play algorithm. We provide theoretical guarantees of our work and test on the AI benchmark task of Hanabi, where we demonstrate our methods outperforming other symmetry-aware baselines in zero-shot coordination, as well as able to improve the coordination ability of a variety of pre-trained policies. In particular, we show our method can be used to improve on the state of the art for zero-shot coordination on the Hanabi benchmark.

下载PDF全文

下载文献需遵守相关版权规定

论文标题