We explore online learning in episodic loop-free Markov decision processes on
non-stationary environments (changing losses and probability transitions). Our
focus is on the Concave Utility Reinforcement Learning problem (CURL), an
extension of classical RL for handling convex performance criteria in
state-action distributions induced by agent policies. While various machine
learning problems can be written as CURL, its non-linearity invalidates
traditional Bellman equations. Despite recent solutions to classical CURL, none
address non-stationary MDPs. This paper introduces MetaCURL, the first CURL
algorithm for non-stationary MDPs. It employs a meta-algorithm running multiple
black-box algorithms instances over different intervals, aggregating outputs
via a sleeping expert framework. The key hurdle is partial information due to
MDP uncertainty. Under partial information on the probability transitions
(uncertainty and non-stationarity coming only from external noise, independent
of agent state-action pairs), we achieve optimal dynamic regret without prior
knowledge of MDP changes. Unlike approaches for RL, MetaCURL handles full
adversarial losses, not just stochastic ones. We believe our approach for
managing non-stationarity with experts can be of interest to the RL community.

我们通过元算法和专家集成的方法在非平稳环境（变换的损失和概率转换）中探索在线学习在无环节马尔可夫决策过程中的应用，重点研究了处理凸性性能准则的经典强化学习的扩展问题 CURL。我们的方法能够在部分信息下，不需要先验的 MDP 更改知识，实现最优的动态遗憾，处理了全面对抗的损失而不仅仅是随机的。我们认为我们处理专家管理非平稳性的方法对强化学习社区具有一定的利益。

MetaCURL: 非平稳凹效用强化学习

MetaCURL: Non-stationary Concave Utility Reinforcement Learning

We consider inverse reinforcement learning problems with concave utilities.
Concave Utility Reinforcement Learning (CURL) is a generalisation of the
standard RL objective, which employs a concave function of the state occupancy
measure, rather than a linear function. CURL has garnered recent attention for
its ability to represent instances of many important applications including the
standard RL such as imitation learning, pure exploration, constrained MDPs,
offline RL, human-regularized RL, and others. Inverse reinforcement learning is
a powerful paradigm that focuses on recovering an unknown reward function that
can rationalize the observed behaviour of an agent. There has been recent
theoretical advances in inverse RL where the problem is formulated as
identifying the set of feasible reward functions. However, inverse RL for CURL
problems has not been considered previously. In this paper we show that most of
the standard IRL results do not apply to CURL in general, since CURL
invalidates the classical Bellman equations. This calls for a new theoretical
framework for the inverse CURL problem. Using a recent equivalence result
between CURL and Mean-field Games, we propose a new definition for the feasible
rewards for I-CURL by proving that this problem is equivalent to an inverse
game theory problem in a subclass of mean-field games. We present initial query
and sample complexity results for the I-CURL problem under assumptions such as
Lipschitz-continuity. Finally, we outline future directions and applications in
human--AI collaboration enabled by our results.

我们提出了新的逆反强化学习问题的理论框架，将 concave function 应用于 CURL，并创造性地将其等效于 mean-field games 的逆博弈理论问题，从而揭示了 CURL 问题与传统逆强化学习不同的特性和挑战。

逆凹效用增强学习即逆博弈论

Inverse Concave-Utility Reinforcement Learning is Inverse Game Theory

Concave Utility Reinforcement Learning (CURL) extends RL from linear to
concave utilities in the occupancy measure induced by the agent's policy. This
encompasses not only RL but also imitation learning and exploration, among
others. Yet, this more general paradigm invalidates the classical Bellman
equations, and calls for new algorithms. Mean-field Games (MFGs) are a
continuous approximation of many-agent RL. They consider the limit case of a
continuous distribution of identical agents, anonymous with symmetric
interests, and reduce the problem to the study of a single representative agent
in interaction with the full population. Our core contribution consists in
showing that CURL is a subclass of MFGs. We think this important to bridge
together both communities. It also allows to shed light on aspects of both
fields: we show the equivalence between concavity in CURL and monotonicity in
the associated MFG, between optimality conditions in CURL and Nash equilibrium
in MFG, or that Fictitious Play (FP) for this class of MFGs is simply
Frank-Wolfe, bringing the first convergence rate for discrete-time FP for MFGs.
We also experimentally demonstrate that, using algorithms recently introduced
for solving MFGs, we can address the CURL problem more efficiently.

本研究介绍了基于凹效用函数的强化学习模型 CURL，它扩展了线性到凹效用，同时将模仿学习和探索等领域纳入范畴。该模型违反经典 Bellman 方程，需要新算法。本文通过证明 CURL 是 MFG 的子类，将两个社区联系了起来，并通过实验表明，最近为 MFG 解决问题引入的算法可以更有效地解决 CURL 问题。