We address the crucial yet underexplored stability properties of the
Hamilton--Jacobi--Bellman (HJB) equation in model-free reinforcement learning
contexts, specifically for Lipschitz continuous optimal control problems. We
bridge the gap between Lipschitz continuous optimal control problems and
classical optimal control problems in the viscosity solutions framework,
offering new insights into the stability of the value function of Lipschitz
continuous optimal control problems. By introducing structural assumptions on
the dynamics and reward functions, we further study the rate of convergence of
value functions. Moreover, we introduce a generalized framework for Lipschitz
continuous control problems that incorporates the original problem and leverage
it to propose a new HJB-based reinforcement learning algorithm. The stability
properties and performance of the proposed method are tested with well-known
benchmark examples in comparison with existing approaches.

我们研究了模型无关的强化学习环境下 Hamilton-Jacobi-Bellman 方程的稳定性属性，特别是对于 Lipschitz 连续最优控制问题。通过在动力学和奖励函数中引入结构假设，我们进一步研究了值函数的收敛速度。此外，我们引入了一个广义框架，用于处理包含原始问题的 Lipschitz 连续控制问题，并基于此提出了一种新的基于 HJB 的强化学习算法。通过与现有方法的比较，我们测试了所提方法的稳定性和性能，并使用众所周知的基准示例进行了验证。

关于 Lipschitz 连续控制问题的稳定性及其在强化学习中的应用

On the stability of Lipschitz continuous control problems and its  application to reinforcement learning

This paper deals with a class of neural SDEs and studies the limiting
behavior of the associated sampled optimal control problems as the sample size
grows to infinity. The neural SDEs with N samples can be linked to the
N-particle systems with centralized control. We analyze the
Hamilton--Jacobi--Bellman equation corresponding to the N-particle system and
establish regularity results which are uniform in N. The uniform regularity
estimates are obtained by the stochastic maximum principle and the analysis of
a backward stochastic Riccati equation. Using these uniform regularity results,
we show the convergence of the minima of objective functionals and optimal
parameters of the neural SDEs as the sample size N tends to infinity. The
limiting objects can be identified with suitable functions defined on the
Wasserstein space of Borel probability measures. Furthermore, quantitative
algebraic convergence rates are also obtained.

该文研究了一类神经随机微分方程的极限行为，推导出一种关于神经随机微分方程的最优控制问题的汉密尔顿 - 雅可比 - 贝尔曼方程，通过分析反向随机黎卡蒂方程得出一种统一的正则估计结果，利用这些正则估计结果展示了目标函数极小值和神经随机微分方程的最优参数在样本大小趋于无穷的情况下的收敛性，而极限对象可以通过定义在波雷尔概率测度的 Wasserstein 空间上的适当函数来确定，并获得定量代数收敛速度。

深度学习中产生的受控粒子系统的收敛性分析：从有限样本到无限样本大小

Convergence analysis of controlled particle systems arising in deep  learning: from finite to infinite sample size

The aim of this work is to develop deep learning-based algorithms for
high-dimensional stochastic control problems based on physics-informed learning
and dynamic programming. Unlike classical deep learning-based methods relying
on a probabilistic representation of the solution to the
Hamilton--Jacobi--Bellman (HJB) equation, we introduce a pathwise operator
associated with the HJB equation so that we can define a problem of
physics-informed learning. According to whether the optimal control has an
explicit representation, two numerical methods are proposed to solve the
physics-informed learning problem. We provide an error analysis on how the
truncation, approximation and optimization errors affect the accuracy of these
methods. Numerical results on various applications are presented to illustrate
the performance of the proposed algorithms.

基于物理知识学习和动态规划，该研究旨在开发基于深度学习的算法来解决高维随机控制问题；通过引入与 Hamilton-Jacobi-Bellman 方程相关的路径操作，定义了一个物理知识学习问题，并提出了两种数值方法来求解该问题。研究对截断误差，逼近误差和优化误差对这些方法的准确性的影响进行了错误分析，并提供了各种应用的数值结果来说明所提算法的性能。