This paper studies continuous-time risk-sensitive reinforcement learning (RL) under the entropy-regularized, exploratory diffusion process formulation with the exponential-form objective. The risk-sensitive objective arises either as the agent's risk attitude or as a distributionally robust approach against the model uncertainty. Owing to the martingale perspective in Jia and Zhou (2023) the risk-sensitive RL problem is shown to be equivalent to ensuring the martingale property of a process involving both the value function and the q-function, augmented by an additional penalty term: the quadratic variation of the value process, capturing the variability of the value-to-go along the trajectory. This characterization allows for the straightforward adaptation of existing RL algorithms developed for non-risk-sensitive scenarios to incorporate risk sensitivity by adding the realized variance of the value process. Additionally, I highlight that the conventional policy gradient representation is inadequate for risk-sensitive problems due to the nonlinear nature of quadratic variation; however, q-learning offers a solution and extends to infinite horizon settings. Finally, I prove the convergence of the proposed algorithm for Merton's investment problem and quantify the impact of temperature parameter on the behavior of the learning procedure. I also conduct simulation experiments to demonstrate how risk-sensitive RL improves the finite-sample performance in the linear-quadratic control problem.

该研究论文探讨了基于熵正则化的探索性扩散过程形式下的连续时间风险敏感强化学习，包括风险敏感目标函数、马丁格尔观点和二次变化。通过这个特征描述，我们可以通过增加价值过程的实现方差来将非风险敏感RL算法应用于风险敏感场景，并证明了该算法在Merton投资问题中的收敛性，以及温度参数对学习过程行为的影响。此外，通过模拟实验，展示了风险敏感RL在线性二次控制问题中的有限样本性能改善。

连续时间风险敏感强化学习的二次变差惩罚