The combination of Monte-Carlo Tree Search (MCTS) and deep reinforcement
learning is state-of-the-art in two-player perfect-information games. In this
paper, we describe a search algorithm that uses a variant of MCTS which we
enhanced by 1) a novel action value normalization mechanism for games with
potentially unbounded rewards (which is the case in many optimization
problems), 2) defining a virtual loss function that enables effective search
parallelization, and 3) a policy network, trained by generations of self-play,
to guide the search. We gauge the effectiveness of our method in "SameGame"---a
popular single-player test domain. Our experimental results indicate that our
method outperforms baseline algorithms on several board sizes. Additionally, it
is competitive with state-of-the-art search algorithms on a public set of
positions.

本文提出了一种使用基于 Monte-Carlo Tree Search 和深度强化学习相结合的方法的搜索算法，通过 1）用于潜在无限奖励问题的新颖行动价值规范机制，2）定义虚拟损失函数实现有效搜索并行化，以及 3）由自我对弈逐代训练的策略网络引导搜索，来提高搜索算法的效果。我们在同类游戏 SameGame 上进行实验，结果表明我们的算法在多个游戏宽度上优于基准算法，并与公共状态搜索问题的最新算法竞争力相当。