We explore online learning in episodic loop-free Markov decision processes on
non-stationary environments (changing losses and probability transitions). Our
focus is on the Concave Utility Reinforcement Learning problem (CURL), an
extension of classical RL for handling convex performance criteria in
state-action distributions induced by agent policies. While various machine
learning problems can be written as CURL, its non-linearity invalidates
traditional Bellman equations. Despite recent solutions to classical CURL, none
address non-stationary MDPs. This paper introduces MetaCURL, the first CURL
algorithm for non-stationary MDPs. It employs a meta-algorithm running multiple
black-box algorithms instances over different intervals, aggregating outputs
via a sleeping expert framework. The key hurdle is partial information due to
MDP uncertainty. Under partial information on the probability transitions
(uncertainty and non-stationarity coming only from external noise, independent
of agent state-action pairs), we achieve optimal dynamic regret without prior
knowledge of MDP changes. Unlike approaches for RL, MetaCURL handles full
adversarial losses, not just stochastic ones. We believe our approach for
managing non-stationarity with experts can be of interest to the RL community.

我们通过元算法和专家集成的方法在非平稳环境（变换的损失和概率转换）中探索在线学习在无环节马尔可夫决策过程中的应用，重点研究了处理凸性性能准则的经典强化学习的扩展问题 CURL。我们的方法能够在部分信息下，不需要先验的 MDP 更改知识，实现最优的动态遗憾，处理了全面对抗的损失而不仅仅是随机的。我们认为我们处理专家管理非平稳性的方法对强化学习社区具有一定的利益。

MetaCURL: 非平稳凹效用强化学习

MetaCURL: Non-stationary Concave Utility Reinforcement Learning

General function approximation is a powerful tool to handle large state and
action spaces in a broad range of reinforcement learning (RL) scenarios.
However, theoretical understanding of non-stationary MDPs with general function
approximation is still limited. In this paper, we make the first such an
attempt. We first propose a new complexity metric called dynamic Bellman Eluder
(DBE) dimension for non-stationary MDPs, which subsumes majority of existing
tractable RL problems in static MDPs as well as non-stationary MDPs. Based on
the proposed complexity metric, we propose a novel confidence-set based
model-free algorithm called SW-OPEA, which features a sliding window mechanism
and a new confidence set design for non-stationary MDPs. We then establish an
upper bound on the dynamic regret for the proposed algorithm, and show that
SW-OPEA is provably efficient as long as the variation budget is not
significantly large. We further demonstrate via examples of non-stationary
linear and tabular MDPs that our algorithm performs better in small variation
budget scenario than the existing UCB-type algorithms. To the best of our
knowledge, this is the first dynamic regret analysis in non-stationary MDPs
with general function approximation.

本论文针对非平稳 MDP 问题，提出了一种复杂度指标 Dynamic Bellman Eluder 维度和一种新的置信区间算法 SW-OPEA，通过对非平稳线性和表格 MDPs 的示例进行演示，表明该算法在小变化预算场景下性能优于现有的 UCB 类型算法，同时证明了当变化预算不显著大时，SW-OPEA 算法是可以有效地执行。