We consider reinforcement learning (RL) in episodic Markov decision processes (MDPs) with linear function approximation under drifting environment. Specifically, both the reward and state transition functions can evolve over time, as long as their respective total variations, quantified by suitable metrics, do not exceed certain \textit{variation budgets}. We first develop $\texttt{LSVI-UCB-Restart}$ algorithm, an optimistic modification of least-squares value iteration combined with periodic restart, and establish its dynamic regret bound when variation budgets are known. We then propose a parameter-free algorithm, \texttt{Ada-LSVI-UCB-Restart}, that works without knowing the variation budgets, but with a slightly worse dynamic regret bound. We also derive the first minimax dynamic regret lower bound for nonstationary MDPs to show that our proposed algorithms are near-optimal. As a byproduct, we establish a minimax regret lower bound for linear MDPs, which is unsolved by \cite{jin2020provably}. As far as we know, this is the first dynamic regret analysis in nonstationary reinforcement learning with function approximation.

这篇研究采用线性函数逼近的方法来应用强化学习在马尔科夫决策过程中，通过衡量合适的指标来保证奖励和状态转移函数变化的幅度不超过一定的上限，提出了两种最优算法：LSVI-UCB-Restart和Ada-LSVI-UCB-Restart。该研究还为非平稳MDP和线性MDP提供了动态遗憾分析的理论支持，并进行了有效性验证。

使用线性函数逼近的非平稳强化学习