分析小型游戏的超参数：自我玩法中的迭代或时期？

论文标题

分析小型游戏的超参数：自我玩法中的迭代或时期？

Analysis of Hyper-Parameters for Small Games: Iterations or Epochs in Self-Play?

论文作者

Wang, Hui, Emmerich, Michael, Preuss, Mike, Plaat, Aske

论文摘要

Alphago Zero的具有里程碑意义的成就使对强化学习的自我竞争产生了极大的研究兴趣。在自我游戏中，蒙特卡洛树搜索用于训练深层神经网络，然后将其用于树搜索。训练本身受许多超级参数的控制。令人惊讶的是，对超参数值和损失功能的设计选择的研究很少，这可能是因为探索参数空间的高度计算成本。在本文中，我们研究了类似α样的自我游戏算法中的12个超参数，并评估了这些参数如何促进训练。我们使用小型游戏，以中等的计算工作来实现有意义的探索。实验结果表明，训练对高参数的选择高度敏感。通过多目标分析，我们确定了4个重要的超参数以进一步评估。首先，我们发现令人惊讶的结果，有时训练过多会导致较低的性能。我们的主要结果是，自我播放的迭代次数涵盖了MCTS-Search模拟，游戏 - 播种机和训练时代。直觉是，随着自我戏剧的迭代次数的增加，这三个会增加，并且单独增加它们是最佳的。我们实验的结果是在自我播放中设置超参数值的直接建议：应最大化自我播放迭代的总体外环，而有利于三个内环超参数，应将其设置为较低的值。我们实验的次要结果涉及优化目标的选择，我们还为此提供了建议。

The landmark achievements of AlphaGo Zero have created great research interest into self-play in reinforcement learning. In self-play, Monte Carlo Tree Search is used to train a deep neural network, that is then used in tree searches. Training itself is governed by many hyperparameters.There has been surprisingly little research on design choices for hyper-parameter values and loss-functions, presumably because of the prohibitive computational cost to explore the parameter space. In this paper, we investigate 12 hyper-parameters in an AlphaZero-like self-play algorithm and evaluate how these parameters contribute to training. We use small games, to achieve meaningful exploration with moderate computational effort. The experimental results show that training is highly sensitive to hyper-parameter choices. Through multi-objective analysis we identify 4 important hyper-parameters to further assess. To start, we find surprising results where too much training can sometimes lead to lower performance. Our main result is that the number of self-play iterations subsumes MCTS-search simulations, game-episodes, and training epochs. The intuition is that these three increase together as self-play iterations increase, and that increasing them individually is sub-optimal. A consequence of our experiments is a direct recommendation for setting hyper-parameter values in self-play: the overarching outer-loop of self-play iterations should be maximized, in favor of the three inner-loop hyper-parameters, which should be set at lower values. A secondary result of our experiments concerns the choice of optimization goals, for which we also provide recommendations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题