We study off-dynamics Reinforcement Learning (RL), where the policy is trained on a source domain and deployed to a distinct target domain. We aim to solve this problem via online distributionally robust Markov decision processes (DRMDPs), where the learning algorithm actively interacts with the source domain while seeking the optimal performance under the worst possible dynamics that is within an uncertainty set of the source domain's transition kernel. We provide the first study on online DRMDPs with function approximation for off-dynamics RL. We find that DRMDPs' dual formulation can induce nonlinearity, even when the nominal transition kernel is linear, leading to error propagation. By designing a $d$-rectangular uncertainty set using the total variation distance, we remove this additional nonlinearity and bypass the error propagation. We then introduce DR-LSVI-UCB, the first provably efficient online DRMDP algorithm for off-dynamics RL with function approximation, and establish a polynomial suboptimality bound that is independent of the state and action space sizes. Our work makes the first step towards a deeper understanding of the provable efficiency of online DRMDPs with linear function approximation. Finally, we substantiate the performance and robustness of DR-LSVI-UCB through different numerical experiments.

我们研究在源域进行训练并在不同的目标域中部署的离线动态强化学习，通过在线分布鲁棒的马尔可夫决策过程来解决此问题，我们的学习算法在与源域交互时寻求在源域转移核不确定性集合中最坏动态下的最优性能。我们设计了一个使用总变差距离的$d$-长方形不确定性集合，通过去除额外的非线性性和绕过误差传播来解决DRMDPs的非线性问题，并引入了DR-LSVI-UCB算法，这是第一个在离线动态强化学习中具有函数逼近的可验证高效性的在线DRMDP算法，并建立了一个与状态和动作空间大小无关的多项式次优性界限。我们的工作是对在线DRMDPs与线性函数逼近的可验证高效性的深入理解的第一步。最后，我们通过不同的数值实验验证了DR-LSVI-UCB的性能和鲁棒性。

分布鲁棒离轨强化学习: 通过线性函数逼近的证明效率