We study off-dynamics Reinforcement Learning (RL), where the policy is
trained on a source domain and deployed to a distinct target domain. We aim to
solve this problem via online distributionally robust Markov decision processes
(DRMDPs), where the learning algorithm actively interacts with the source
domain while seeking the optimal performance under the worst possible dynamics
that is within an uncertainty set of the source domain's transition kernel. We
provide the first study on online DRMDPs with function approximation for
off-dynamics RL. We find that DRMDPs' dual formulation can induce nonlinearity,
even when the nominal transition kernel is linear, leading to error
propagation. By designing a $d$-rectangular uncertainty set using the total
variation distance, we remove this additional nonlinearity and bypass the error
propagation. We then introduce DR-LSVI-UCB, the first provably efficient online
DRMDP algorithm for off-dynamics RL with function approximation, and establish
a polynomial suboptimality bound that is independent of the state and action
space sizes. Our work makes the first step towards a deeper understanding of
the provable efficiency of online DRMDPs with linear function approximation.
Finally, we substantiate the performance and robustness of DR-LSVI-UCB through
different numerical experiments.

我们研究在源域进行训练并在不同的目标域中部署的离线动态强化学习，通过在线分布鲁棒的马尔可夫决策过程来解决此问题，我们的学习算法在与源域交互时寻求在源域转移核不确定性集合中最坏动态下的最优性能。我们设计了一个使用总变差距离的 $d$- 长方形不确定性集合，通过去除额外的非线性性和绕过误差传播来解决 DRMDPs 的非线性问题，并引入了 DR-LSVI-UCB 算法，这是第一个在离线动态强化学习中具有函数逼近的可验证高效性的在线 DRMDP 算法，并建立了一个与状态和动作空间大小无关的多项式次优性界限。我们的工作是对在线 DRMDPs 与线性函数逼近的可验证高效性的深入理解的第一步。最后，我们通过不同的数值实验验证了 DR-LSVI-UCB 的性能和鲁棒性。