Distributionally Robust Reinforcement Learning (DR-RL) aims to derive a
policy optimizing the worst-case performance within a predefined uncertainty
set. Despite extensive research, previous DR-RL algorithms have predominantly
favored model-based approaches, with limited availability of model-free methods
offering convergence guarantees or sample complexities. This paper proposes a
model-free DR-RL algorithm leveraging the Multi-level Monte Carlo (MLMC)
technique to close such a gap. Our innovative approach integrates a threshold
mechanism that ensures finite sample requirements for algorithmic
implementation, a significant improvement than previous model-free algorithms.
We develop algorithms for uncertainty sets defined by total variation,
Chi-square divergence, and KL divergence, and provide finite sample analyses
under all three cases. Remarkably, our algorithms represent the first
model-free DR-RL approach featuring finite sample complexity for total
variation and Chi-square divergence uncertainty sets, while also offering an
improved sample complexity and broader applicability compared to existing
model-free DR-RL algorithms for the KL divergence model. The complexities of
our method establish the tightest results for all three uncertainty models in
model-free DR-RL, underscoring the effectiveness and efficiency of our
algorithm, and highlighting its potential for practical applications.

分布式鲁棒强化学习提出了一种模型自由的算法，利用多级蒙特卡洛技术来优化最坏情况性能，解决了以往模型自由的算法在收敛保证和样本复杂度方面的限制，并提供了三种不确定性情况下的有限样本分析，从而实现了分布式鲁棒强化学习的模型自由方法的复杂度最优结果，突出了算法的效果与效率，凸显其在实际应用中的潜力。

无模型鲁棒强化学习及样本复杂度分析

Model-Free Robust Reinforcement Learning with Sample Complexity Analysis

Motivated by the need for a robust policy in the face of environment shifts
between training and the deployment, we contribute to the theoretical
foundation of distributionally robust reinforcement learning (DRRL). This is
accomplished through a comprehensive modeling framework centered around
distributionally robust Markov decision processes (DRMDPs). This framework
obliges the decision maker to choose an optimal policy under the worst-case
distributional shift orchestrated by an adversary. By unifying and extending
existing formulations, we rigorously construct DRMDPs that embraces various
modeling attributes for both the decision maker and the adversary. These
attributes include adaptability granularity, exploring history-dependent,
Markov, and Markov time-homogeneous decision maker and adversary dynamics.
Additionally, we delve into the flexibility of shifts induced by the adversary,
examining SA and S-rectangularity. Within this DRMDP framework, we investigate
conditions for the existence or absence of the dynamic programming principle
(DPP). From an algorithmic standpoint, the existence of DPP holds significant
implications, as the vast majority of existing data and computationally
efficiency RL algorithms are reliant on the DPP. To study its existence, we
comprehensively examine combinations of controller and adversary attributes,
providing streamlined proofs grounded in a unified methodology. We also offer
counterexamples for settings in which a DPP with full generality is absent.

鉴于训练和部署之间环境变化的需求，我们对分布稳健强化学习（DRRL）的理论基础做出贡献。通过一个以分布稳健马尔可夫决策过程（DRMDPs）为核心的综合建模框架，我们严谨地构建了适用于决策者和对手的各种建模属性。此外，我们还研究了对手引起的偏移的灵活性，并检验了动态规划原理的存在条件。从算法的角度来看，动态规划原理的存在具有重要意义，因为大多数现有的数据和计算效率强化学习算法依赖于该原理。我们提供了从统一方法论出发的简化证明以及不存在全面广义动态规划原理的场景的反例。

基于分布鲁棒的强化学习基础探讨

On the Foundation of Distributionally Robust Reinforcement Learning

We consider the problem of learning a control policy that is robust against
the parameter mismatches between the training environment and testing
environment. We formulate this as a distributionally robust reinforcement
learning (DR-RL) problem where the objective is to learn the policy which
maximizes the value function against the worst possible stochastic model of the
environment in an uncertainty set. We focus on the tabular episodic learning
setting where the algorithm has access to a generative model of the nominal
(training) environment around which the uncertainty set is defined. We propose
the Robust Phased Value Learning (RPVL) algorithm to solve this problem for the
uncertainty sets specified by four different divergences: total variation,
chi-square, Kullback-Leibler, and Wasserstein. We show that our algorithm
achieves $\tilde{\mathcal{O}}(|\mathcal{S}||\mathcal{A}| H^{5})$ sample
complexity, which is uniformly better than the existing results by a factor of
$|\mathcal{S}|$, where $|\mathcal{S}|$ is number of states, $|\mathcal{A}|$ is
the number of actions, and $H$ is the horizon length. We also provide the
first-ever sample complexity result for the Wasserstein uncertainty set.
Finally, we demonstrate the performance of our algorithm using simulation
experiments.

本文提出了一种分布式抗干扰强化学习算法，即 Robust Phased Value Learning 算法，该算法针对四种不同的差距度量指标的不确定性集合进行求解，得到的结果在样本复杂度方面比现有结果具有更好的一致性。