General function approximation is a powerful tool to handle large state and
action spaces in a broad range of reinforcement learning (RL) scenarios.
However, theoretical understanding of non-stationary MDPs with general function
approximation is still limited. In this paper, we make the first such an
attempt. We first propose a new complexity metric called dynamic Bellman Eluder
(DBE) dimension for non-stationary MDPs, which subsumes majority of existing
tractable RL problems in static MDPs as well as non-stationary MDPs. Based on
the proposed complexity metric, we propose a novel confidence-set based
model-free algorithm called SW-OPEA, which features a sliding window mechanism
and a new confidence set design for non-stationary MDPs. We then establish an
upper bound on the dynamic regret for the proposed algorithm, and show that
SW-OPEA is provably efficient as long as the variation budget is not
significantly large. We further demonstrate via examples of non-stationary
linear and tabular MDPs that our algorithm performs better in small variation
budget scenario than the existing UCB-type algorithms. To the best of our
knowledge, this is the first dynamic regret analysis in non-stationary MDPs
with general function approximation.

本论文针对非平稳 MDP 问题，提出了一种复杂度指标 Dynamic Bellman Eluder 维度和一种新的置信区间算法 SW-OPEA，通过对非平稳线性和表格 MDPs 的示例进行演示，表明该算法在小变化预算场景下性能优于现有的 UCB 类型算法，同时证明了当变化预算不显著大时，SW-OPEA 算法是可以有效地执行。

一般函数近似下的非平稳强化学习

Non-stationary Reinforcement Learning under General Function  Approximation

We consider reinforcement learning (RL) in episodic Markov decision processes
(MDPs) with linear function approximation under drifting environment.
Specifically, both the reward and state transition functions can evolve over
time, as long as their respective total variations, quantified by suitable
metrics, do not exceed certain \textit{variation budgets}. We first develop the
$\texttt{LSVI-UCB-Restart}$ algorithm, an optimistic modification of
least-squares value iteration combined with periodic restart, and establish its
dynamic regret bound when variation budgets are known. We then propose a
parameter-free algorithm, $\texttt{Ada-LSVI-UCB-Restart}$, that works without
knowing the variation budgets, but with a slightly worse dynamic regret bound.
We also derive the first minimax dynamic regret lower bound for nonstationary
MDPs to show that our proposed algorithms are near-optimal. As a byproduct, we
establish a minimax regret lower bound for linear MDPs, which is unsolved by
\cite{jin2020provably}. In addition, we provide numerical experiments to
demonstrate the effectiveness of our proposed algorithms. As far as we know,
this is the first dynamic regret analysis in nonstationary reinforcement
learning with function approximation.

这篇研究采用线性函数逼近的方法来应用强化学习在马尔科夫决策过程中，通过衡量合适的指标来保证奖励和状态转移函数变化的幅度不超过一定的上限，提出了两种最优算法：LSVI-UCB-Restart 和 Ada-LSVI-UCB-Restart。该研究还为非平稳 MDP 和线性 MDP 提供了动态遗憾分析的理论支持，并进行了有效性验证。