We study reinforcement learning (RL) in the setting of continuous time and
space, for an infinite horizon with a discounted objective and the underlying
dynamics driven by a stochastic differential equation. Built upon recent
advances in the continuous approach to RL, we develop a notion of occupation
time (specifically for a discounted objective), and show how it can be
effectively used to derive performance-difference and local-approximation
formulas. We further extend these results to illustrate their applications in
the PG (policy gradient) and TRPO/PPO (trust region policy optimization/
proximal policy optimization) methods, which have been familiar and powerful
tools in the discrete RL setting but under-developed in continuous RL. Through
numerical experiments, we demonstrate the effectiveness and advantages of our
approach.

研究了强化学习在连续时间和空间的设置下的应用，提出了购买力占据时间的概念，并进一步将其应用于策略梯度和 TRPO/PPO 方法中。通过数值实验，验证了此方法的有效性和优势。

连续强化学习的策略优化

Policy Optimization for Continuous Reinforcement Learning

The policy gradient theorem describes the gradient of the expected discounted
return with respect to an agent's policy parameters. However, most policy
gradient methods drop the discount factor from the state distribution and
therefore do not optimize the discounted objective. What do they optimize
instead? This has been an open question for several years, and this lack of
theoretical clarity has lead to an abundance of misstatements in the
literature. We answer this question by proving that the update direction
approximated by most methods is not the gradient of any function. Further, we
argue that algorithms that follow this direction are not guaranteed to converge
to a "reasonable" fixed point by constructing a counterexample wherein the
fixed point is globally pessimal with respect to both the discounted and
undiscounted objectives. We motivate this work by surveying the literature and
showing that there remains a widespread misunderstanding regarding discounted
policy gradient methods, with errors present even in highly-cited papers
published at top conferences.

全球顶级会议发表的论文中存在误导性，关于 drop state distribution 中的折扣因素对于算法的影响，一些方法没有优化折扣奖励函数，因为它们优化的是逼近 Most method 更新方向的不可微、不存在导函数的函数，因此这些算法不保证会收敛到一个合理的最优解。