论文标题
在基于DNN的最佳搜索索科班计划中,左尾巴和政策和价值网络的有效性
Left Heavy Tails and the Effectiveness of the Policy and Value Networks in DNN-based best-first search for Sokoban Planning
论文作者
论文摘要
尽管实用的求解器在各种NP完整域中取得了成功,例如SAT和CSP以及使用深度强化学习来解决诸如GO之类的两人游戏,但某些类别的Pspace-Hard计划问题仍然无法触及。由于硬实例上的指数搜索空间,即使是精心设计的域专用求解器也可能会迅速失败。结合了传统搜索方法的最新作品,例如最佳优先搜索和蒙特卡洛树搜索,以及深度神经网络(DNN)的启发式方法,已经显示出有希望的进步,并且可以解决超出专业求解器以外的大量艰苦计划实例。为了更好地理解这些方法为何起作用,我们研究了基于DNN的最佳优先搜索的政策和价值网络的相互作用,并展示了策略网络的惊人有效性,并通过价值网络进一步增强了价值网络,作为搜索的指导启发式。为了进一步了解现象,我们研究了搜索算法的成本分布,发现索科巴实例可以具有重尾的运行时分布,左侧和右侧都有尾巴。特别是,我们首次展示了\ textit {左尾巴}的存在,并提出了一个抽象的树模型,可以从经验上解释这些尾巴的外观。实验表明,政策网络是一种强大的启发式指导搜索的关键作用,这可以通过避免探索成倍的尺寸的子树来导致左尾部具有多项式缩放。我们的结果还证明了与传统组合求解器中广泛使用的随机重新开始的重要性,用于避免左和右重尾巴的基于DNN的搜索方法。
Despite the success of practical solvers in various NP-complete domains such as SAT and CSP as well as using deep reinforcement learning to tackle two-player games such as Go, certain classes of PSPACE-hard planning problems have remained out of reach. Even carefully designed domain-specialized solvers can fail quickly due to the exponential search space on hard instances. Recent works that combine traditional search methods, such as best-first search and Monte Carlo tree search, with Deep Neural Networks' (DNN) heuristics have shown promising progress and can solve a significant number of hard planning instances beyond specialized solvers. To better understand why these approaches work, we studied the interplay of the policy and value networks of DNN-based best-first search on Sokoban and show the surprising effectiveness of the policy network, further enhanced by the value network, as a guiding heuristic for the search. To further understand the phenomena, we studied the cost distribution of the search algorithms and found that Sokoban instances can have heavy-tailed runtime distributions, with tails both on the left and right-hand sides. In particular, for the first time, we show the existence of \textit{left heavy tails} and propose an abstract tree model that can empirically explain the appearance of these tails. The experiments show the critical role of the policy network as a powerful heuristic guiding the search, which can lead to left heavy tails with polynomial scaling by avoiding exploring exponentially sized subtrees. Our results also demonstrate the importance of random restarts, as are widely used in traditional combinatorial solvers, for DNN-based search methods to avoid left and right heavy tails.