论文标题
有效利用启发式方法在马尔可夫游戏中加速基于XCS的政策学习
Efficient Use of heuristics for accelerating XCS-based Policy Learning in Markov Games
论文作者
论文摘要
在马尔可夫游戏中,与具有学习能力的非机构对手对抗对强化学习(RL)代理人仍然具有挑战性,因为对手可以同时发展其政策。这增加了学习任务的复杂性,并降低了RL代理的学习速度。本文提议有效利用粗糙的启发式方法来加快与并发学习者进行比赛时的政策学习。具体而言,我们提出了一种算法,该算法可以通过在零和马尔可夫游戏中使用扩展分类器系统(XCS)来代表定量启发式方法和对手模型来有效地学习可解释和普遍的行动选择规则。神经网络用于对对手的行为进行建模,并推断出相应的策略以进行行动选择和规则演变。在多种启发式政策的情况下,我们介绍了帕累托最佳选择的概念。此外,借助XC的条件表示和匹配机制的优势,启发式策略和对手模型可以为具有相似特征表示的情况提供指导。此外,我们引入了一种基于准确的资格痕量机制,以加快规则演化的速度,即可以根据其准确性加强可以匹配历史痕迹的分类器。我们证明了所提出的算法比足球和小偷和猎人场景中的几种基准算法的优势。
In Markov games, playing against non-stationary opponents with learning ability is still challenging for reinforcement learning (RL) agents, because the opponents can evolve their policies concurrently. This increases the complexity of the learning task and slows down the learning speed of the RL agents. This paper proposes efficient use of rough heuristics to speed up policy learning when playing against concurrent learners. Specifically, we propose an algorithm that can efficiently learn explainable and generalized action selection rules by taking advantages of the representation of quantitative heuristics and an opponent model with an eXtended classifier system (XCS) in zero-sum Markov games. A neural network is used to model the opponent from their behaviors and the corresponding policy is inferred for action selection and rule evolution. In cases of multiple heuristic policies, we introduce the concept of Pareto optimality for action selection. Besides, taking advantages of the condition representation and matching mechanism of XCS, the heuristic policies and the opponent model can provide guidance for situations with similar feature representation. Furthermore, we introduce an accuracy-based eligibility trace mechanism to speed up rule evolution, i.e., classifiers that can match the historical traces are reinforced according to their accuracy. We demonstrate the advantages of the proposed algorithm over several benchmark algorithms in a soccer and a thief-and-hunter scenarios.