We study risk-sensitive reinforcement learning (RL) based on an entropic risk
measure in episodic non-stationary Markov decision processes (MDPs). Both the
reward functions and the state transition kernels are unknown and allowed to
vary arbitrarily over time with a budget on their cumulative variations. When
this variation budget is known a prior, we propose two restart-based
algorithms, namely Restart-RSMB and Restart-RSQ, and establish their dynamic
regrets. Based on these results, we further present a meta-algorithm that does
not require any prior knowledge of the variation budget and can adaptively
detect the non-stationarity on the exponential value functions. A dynamic
regret lower bound is then established for non-stationary risk-sensitive RL to
certify the near-optimality of the proposed algorithms. Our results also show
that the risk control and the handling of the non-stationarity can be
separately designed in the algorithm if the variation budget is known a prior,
while the non-stationary detection mechanism in the adaptive algorithm depends
on the risk parameter. This work offers the first non-asymptotic theoretical
analyses for the non-stationary risk-sensitive RL in the literature.

研究使用熵风险度量在非平稳有限马尔可夫决策过程中采用风险敏感强化学习，提出了两种基于重启的算法以及自适应检测不稳定性的元算法，并证明了算法的动态后悔下界。该研究为文献中的非平稳风险敏感强化学习提供了首个非渐近理论分析。

非平稳风险敏感强化学习：近似最优动态遗憾、自适应检测和分离设计

Non-stationary Risk-sensitive Reinforcement Learning: Near-optimal  Dynamic Regret, Adaptive Detection, and Separation Design

We study the regret guarantee for risk-sensitive reinforcement learning
(RSRL) via distributional reinforcement learning (DRL) methods. In particular,
we consider finite episodic Markov decision processes whose objective is the
entropic risk measure (EntRM) of return. We identify a key property of the
EntRM, the monotonicity-preserving property, which enables the risk-sensitive
distributional dynamic programming framework. We then propose two novel DRL
algorithms that implement optimism through two different schemes, including a
model-free one and a model-based one.
We prove that both of them attain $\tilde{\mathcal{O}}(\frac{\exp(|\beta|
H)-1}{|\beta|H}H\sqrt{HS^2AT})$ regret upper bound, where $S$ is the number of
states, $A$ the number of states, $H$ the time horizon and $T$ the number of
total time steps. It matches RSVI2 proposed in \cite{fei2021exponential} with a
much simpler regret analysis. To the best of our knowledge, this is the first
regret analysis of DRL, which bridges DRL and RSRL in terms of sample
complexity. Finally, we improve the existing lower bound by proving a tighter
bound of $\Omega(\frac{\exp(\beta H/6)-1}{\beta H}H\sqrt{SAT})$ for $\beta>0$
case, which recovers the tight lower bound $\Omega(H\sqrt{SAT})$ in the
risk-neutral setting.

研究了通过分布式强化学习方法实现风险敏感强化学习的后悔保证，提出了两种新的 DRL 算法，并通过样本复杂度桥接了 DRL 和 RSRL。同时还改进了现有的下限，并提出了更紧的下限。

通过可证明遗憾界实现分布式和风险敏感的强化学习

Bridging Distributional and Risk-sensitive Reinforcement Learning with Provable Regret Bounds

We study risk-sensitive reinforcement learning (RL) based on the entropic
risk measure. Although existing works have established non-asymptotic regret
guarantees for this problem, they leave open an exponential gap between the
upper and lower bounds. We identify the deficiencies in existing algorithms and
their analysis that result in such a gap. To remedy these deficiencies, we
investigate a simple transformation of the risk-sensitive Bellman equations,
which we call the exponential Bellman equation. The exponential Bellman
equation inspires us to develop a novel analysis of Bellman backup procedures
in risk-sensitive RL algorithms, and further motivates the design of a novel
exploration mechanism. We show that these analytic and algorithmic innovations
together lead to improved regret upper bounds over existing ones.

本研究旨在探究基于熵风险度量的风险敏感强化学习，通过开发一种新的风险敏感反馈机制，使得监督过程能够更有效地引导智能体策略的改进，进而提升其性能表现。