Three major challenges in reinforcement learning are the complex dynamical
systems with large state spaces, the costly data acquisition processes, and the
deviation of real-world dynamics from the training environment deployment. To
overcome these issues, we study distributionally robust Markov decision
processes with continuous state spaces under the widely used Kullback-Leibler,
chi-square, and total variation uncertainty sets. We propose a model-based
approach that utilizes Gaussian Processes and the maximum variance reduction
algorithm to efficiently learn multi-output nominal transition dynamics,
leveraging access to a generative model (i.e., simulator). We further
demonstrate the statistical sample complexity of the proposed method for
different uncertainty sets. These complexity bounds are independent of the
number of states and extend beyond linear dynamics, ensuring the effectiveness
of our approach in identifying near-optimal distributionally-robust policies.
The proposed method can be further combined with other model-free
distributionally robust reinforcement learning methods to obtain a near-optimal
robust policy. Experimental results demonstrate the robustness of our algorithm
to distributional shifts and its superior performance in terms of the number of
samples needed.

提出了一种基于高斯过程和最大方差缩减算法的模型基础方法，用于学习多输出名义转移动力学，克服了强化学习中的若干挑战，并在分布移位方面展示了算法的鲁棒性以及样本数量上的优越性。

大型状态空间下的分布鲁棒基于模型的强化学习

Distributionally Robust Model-based Reinforcement Learning with Large  State Spaces

This paper investigates model robustness in reinforcement learning (RL) to
reduce the sim-to-real gap in practice. We adopt the framework of
distributionally robust Markov decision processes (RMDPs), aimed at learning a
policy that optimizes the worst-case performance when the deployed environment
falls within a prescribed uncertainty set around the nominal MDP. Despite
recent efforts, the sample complexity of RMDPs remained mostly unsettled
regardless of the uncertainty set in use. It was unclear if distributional
robustness bears any statistical consequences when benchmarked against standard
RL.
Assuming access to a generative model that draws samples based on the nominal
MDP, we characterize the sample complexity of RMDPs when the uncertainty set is
specified via either the total variation (TV) distance or $\chi^2$ divergence.
The algorithm studied here is a model-based method called {\em distributionally
robust value iteration}, which is shown to be near-optimal for the full range
of uncertainty levels. Somewhat surprisingly, our results uncover that RMDPs
are not necessarily easier or harder to learn than standard MDPs. The
statistical consequence incurred by the robustness requirement depends heavily
on the size and shape of the uncertainty set: in the case w.r.t.~the TV
distance, the minimax sample complexity of RMDPs is always smaller than that of
standard MDPs; in the case w.r.t.~the $\chi^2$ divergence, the sample
complexity of RMDPs can often far exceed the standard MDP counterpart.

研究强化学习中模型鲁棒性以减少实践中的模拟与实际之间的差距，采用分布鲁棒马尔可夫决策过程的框架，在规定的不确定性集合范围内学习最优性能策略，对于不同的不确定性集合，分别用基于模型的方法分析分布鲁棒价值迭代的采样复杂性，结果表明分布鲁棒马尔可夫决策过程并不一定比标准马尔可夫决策过程更易或更难学习，而是依赖于不确定性集合的大小和形状。