Adversarial self-play in two-player games has delivered impressive results
when used with reinforcement learning algorithms that combine deep neural
networks and tree search. Algorithms like AlphaZero and Expert Iteration learn
tabula-rasa, producing highly informative training data on the fly. However,
the self-play training strategy is not directly applicable to single-player
games. Recently, several practically important combinatorial optimisation
problems, such as the travelling salesman problem and the bin packing problem,
have been reformulated as reinforcement learning problems, increasing the
importance of enabling the benefits of self-play beyond two-player games. We
present the Ranked Reward (R2) algorithm which accomplishes this by ranking the
rewards obtained by a single agent over multiple games to create a relative
performance metric. Results from applying the R2 algorithm to instances of a
two-dimensional and three-dimensional bin packing problems show that it
outperforms generic Monte Carlo tree search, heuristic algorithms and integer
programming solvers. We also present an analysis of the ranked reward
mechanism, in particular, the effects of problem instances with varying
difficulty and different ranking thresholds.

本文介绍了一个名为 Ranked Reward（R2）的算法，它能够将敌对自我博弈用于单人游戏，并将其应用于维度为 2 和 3 的装箱问题，证明该算法胜过基本的蒙特卡罗搜索、启发式算法、整数规划求解器，并对排名奖励机制进行了分析。