Many important real-world problems have action spaces that are
high-dimensional, continuous or both, making full enumeration of all possible
actions infeasible. Instead, only small subsets of actions can be sampled for
the purpose of policy evaluation and improvement. In this paper, we propose a
general framework to reason in a principled way about policy evaluation and
improvement over such sampled action subsets. This sample-based policy
iteration framework can in principle be applied to any reinforcement learning
algorithm based upon policy iteration. Concretely, we propose Sampled MuZero,
an extension of the MuZero algorithm that is able to learn in domains with
arbitrarily complex action spaces by planning over sampled actions. We
demonstrate this approach on the classical board game of Go and on two
continuous control benchmark domains: DeepMind Control Suite and Real-World RL
Suite.

本文提出了一个基于策略迭代的通用框架，可以在对一小部分行动的样本进行策略评估和改进的情况下对强化学习算法进行推理。其中，样本化 MuZero 是 MuZero 算法的一个扩展，可以在计划采样动作的情况下学习具有任意复杂行动空间的目标。作者用围棋和 DeepMind 的控制套件以及真实世界的强化学习测试开展了实证研究。