Dynamic decision making under distributional shifts is of fundamental
interest in theory and applications of reinforcement learning: The distribution
of the environment on which the data is collected can differ from that of the
environment on which the model is deployed. This paper presents two novel
model-free algorithms, namely the distributionally robust Q-learning and its
variance-reduced counterpart, that can effectively learn a robust policy
despite distributional shifts. These algorithms are designed to efficiently
approximate the $q$-function of an infinite-horizon $\gamma$-discounted robust
Markov decision process with Kullback-Leibler uncertainty set to an entry-wise
$\epsilon$-degree of precision. Further, the variance-reduced distributionally
robust Q-learning combines the synchronous Q-learning with variance-reduction
techniques to enhance its performance. Consequently, we establish that it
attains a minmax sample complexity upper bound of $\tilde
O(|S||A|(1-\gamma)^{-4}\epsilon^{-2})$, where $S$ and $A$ denote the state and
action spaces. This is the first complexity result that is independent of the
uncertainty size $\delta$, thereby providing new complexity theoretic insights.
Additionally, a series of numerical experiments confirm the theoretical
findings and the efficiency of the algorithms in handling distributional
shifts.

本论文提出了两种模型无关的算法，分别是分布鲁棒的 Q-learning 及其方差缩减的版本。这些算法可在处理分布移位时有效地学习强大的策略。在一系列数值实验中，这些算法的理论发现和效率得到了证实。