Average-reward Markov decision processes (MDPs) provide a foundational framework for sequential decision-making under uncertainty. However, average-reward MDPs have remained largely unexplored in reinforcement learning (RL) settings, with the majority of RL-based efforts having been allocated to episodic and discounted MDPs. In this work, we study a unique structural property of average-reward MDPs and utilize it to introduce Reward-Extended Differential (or RED) reinforcement learning: a novel RL framework that can be used to effectively and efficiently solve various subtasks simultaneously in the average-reward setting. We introduce a family of RED learning algorithms for prediction and control, including proven-convergent algorithms for the tabular case. We then showcase the power of these algorithms by demonstrating how they can be used to learn a policy that optimizes, for the first time, the well-known conditional value-at-risk (CVaR) risk measure in a fully-online manner, without the use of an explicit bi-level optimization scheme or an augmented state-space.

本研究解决了平均奖励马尔可夫决策过程在强化学习中被忽视的问题。通过引入奖励扩展微分（RED）强化学习框架，提出了一种可以有效同时解决多个子任务的算法。研究显示，这些算法能够首次以完全在线的方式优化条件风险价值（CVaR）风险度量，具有重要的应用潜力。

燃烧RED：解锁子任务驱动的强化学习和平均奖励马尔可夫决策过程中的风险意识