Trajectory Optimization (TO) and Reinforcement Learning (RL) are powerful and
complementary tools to solve optimal control problems. On the one hand, TO can
efficiently compute locally-optimal solutions, but it tends to get stuck in
local minima if the problem is not convex. On the other hand, RL is typically
less sensitive to non-convexity, but it requires a much higher computational
effort. Recently, we have proposed CACTO (Continuous Actor-Critic with
Trajectory Optimization), an algorithm that uses TO to guide the exploration of
an actor-critic RL algorithm. In turns, the policy encoded by the actor is used
to warm-start TO, closing the loop between TO and RL. In this work, we present
an extension of CACTO exploiting the idea of Sobolev learning. To make the
training of the critic network faster and more data efficient, we enrich it
with the gradient of the Value function, computed via a backward pass of the
differential dynamic programming algorithm. Our results show that the new
algorithm is more efficient than the original CACTO, reducing the number of TO
episodes by a factor ranging from 3 to 10, and consequently the computation
time. Moreover, we show that CACTO-SL helps TO to find better minima and to
produce more consistent results.

本研究提出了一种基于轨迹优化和强化学习的算法 CACTO-SL，通过使用梯度和值函数来加速批评网络的训练，实验证明该算法比原始的 CACTO 更高效，能够减少计算时间和寻找更好的极小值，同时产生更一致的结果。

CACTO-SL：使用 Sobolev 学习优化连续的 Actor-Critic 和轨迹优化

CACTO-SL: Using Sobolev Learning to improve Continuous Actor-Critic with  Trajectory Optimization

Optimal control (OC) algorithms such as Differential Dynamic Programming
(DDP) take advantage of the derivatives of the dynamics to efficiently control
physical systems. Yet, in the presence of nonsmooth dynamical systems, such
class of algorithms are likely to fail due, for instance, to the presence of
discontinuities in the dynamics derivatives or because of non-informative
gradient. On the contrary, reinforcement learning (RL) algorithms have shown
better empirical results in scenarios exhibiting non-smooth effects (contacts,
frictions, etc). Our approach leverages recent works on randomized smoothing
(RS) to tackle non-smoothness issues commonly encountered in optimal control,
and provides key insights on the interplay between RL and OC through the prism
of RS methods. This naturally leads us to introduce the randomized Differential
Dynamic Programming (R-DDP) algorithm accounting for deterministic but
non-smooth dynamics in a very sample-efficient way. The experiments demonstrate
that our method is able to solve classic robotic problems with dry friction and
frictional contacts, where classical OC algorithms are likely to fail and RL
algorithms require in practice a prohibitive number of samples to find an
optimal solution.

本文利用随机平滑 (Randomized Smoothing) 方式解决了在非光滑动力系统中，优化控制算法（Optimal Control Algorithm）的问题，并通过随机化差分动态规划 (Randomized Differential Dynamic Programming) 算法有效的处理了确定性但非光滑的动态系统，实验显示此方法能有效解决传统优化控制算法无法解决而强化学习（Reinforcement Learning）算法需要过多样本的典型机器人问题。

利用随机平滑技术优化控制非光滑动力系统

Leveraging Randomized Smoothing for Optimal Control of Nonsmooth  Dynamical Systems

Interpretation of Deep Neural Networks (DNNs) training as an optimal control
problem with nonlinear dynamical systems has received considerable attention
recently, yet the algorithmic development remains relatively limited. In this
work, we make an attempt along this line by reformulating the training
procedure from the trajectory optimization perspective. We first show that most
widely-used algorithms for training DNNs can be linked to the Differential
Dynamic Programming (DDP), a celebrated second-order method rooted in the
Approximate Dynamic Programming. In this vein, we propose a new class of
optimizer, DDP Neural Optimizer (DDPNOpt), for training feedforward and
convolution networks. DDPNOpt features layer-wise feedback policies which
improve convergence and reduce sensitivity to hyper-parameter over existing
methods. It outperforms other optimal-control inspired training methods in both
convergence and complexity, and is competitive against state-of-the-art first
and second order methods. We also observe DDPNOpt has surprising benefit in
preventing gradient vanishing. Our work opens up new avenues for principled
algorithmic design built upon the optimal control theory.

本研究旨在将深度神经网络的训练过程从轨迹优化的角度重新制定，提出了一种基于差分动态规划的优化器，DDP Neural Optimizer（DDPNOpt），其具有层间反馈策略、收敛性高等优点，且在避免梯度消失方面表现出惊人的优越性，展示了基于最优控制理论的算法设计的新思路。