TL;DR本文提出了一种算法 POLY-HOOT,将连续武装匪徒策略与 Monte-Carlo Tree Search(MCTS)相结合,使用多项式奖励项来增强 HOO 算法,并分析其在非静态匪徒问题中的后悔率和收敛性。
Abstract
monte-carlo planning, as exemplified by Monte-Carlo Tree Search (MCTS), has demonstrated remarkable performance in applications with finite spaces. In this paper, we consider monte-carlo planning in an environmen