TL;DR研究自博弈算法在 Markov 游戏中的应用,提出了 Value Iteration with Upper/Lower Confidence Bound (VI-ULCB) 算法和探索 - 开采算法,并证明了其在策略最佳化中的有效性和高样本利用率。
Abstract
self-play, where the algorithm learns by playing against itself without
requiring any direct supervision, has become the new weapon in modern
reinforcement learning (RL) for achieving superhuman performance in pr