Learning to Optimize (L2O), a technique that utilizes machine learning to
learn an optimization algorithm automatically from data, has gained arising
attention in recent years. A generic L2O approach parameterizes the iterative
update rule and learns the update direction as a black-box network. While the
generic approach is widely applicable, the learned model can overfit and may
not generalize well to out-of-distribution test sets. In this paper, we derive
the basic mathematical conditions that successful update rules commonly
satisfy. Consequently, we propose a novel L2O model with a mathematics-inspired
structure that is broadly applicable and generalized well to
out-of-distribution problems. Numerical simulations validate our theoretical
findings and demonstrate the superior empirical performance of the proposed L2O
model.

本文提出了一种基于数学原理的 L2O 模型，通过数值模拟验证了该模型的理论发现并展示了其超越普通 L2O 模型的优越性。

构建数学结构以实现学习优化

Towards Constituting Mathematical Structures for Learning to Optimize

This paper proposes a new robust update rule of target network for deep
reinforcement learning (DRL), to replace the conventional update rule, given as
an exponential moving average. The target network is for smoothly generating
the reference signals for a main network in DRL, thereby reducing learning
variance. The problem with its conventional update rule is the fact that all
the parameters are smoothly copied with the same speed from the main network,
even when some of them are trying to update toward the wrong directions. This
behavior increases the risk of generating the wrong reference signals. Although
slowing down the overall update speed is a naive way to mitigate wrong updates,
it would decrease learning speed. To robustly update the parameters while
keeping learning speed, a t-soft update method, which is inspired by student-t
distribution, is derived with reference to the analogy between the exponential
moving average and the normal distribution. Through the analysis of the derived
t-soft update, we show that it takes over the properties of the student-t
distribution. Specifically, with a heavy-tailed property of the student-t
distribution, the t-soft update automatically excludes extreme updates that
differ from past experiences. In addition, when the updates are similar to the
past experiences, it can mitigate the learning delay by increasing the amount
of updates. In PyBullet robotics simulations for DRL, an online actor-critic
algorithm with the t-soft update outperformed the conventional methods in terms
of the obtained return and/or its variance. From the training process by the
t-soft update, we found that the t-soft update is globally consistent with the
standard soft update, and the update rates are locally adjusted for
acceleration or suppression.

本文提出了一种新的强化学习（DRL）目标网络的鲁棒更新规则，以替代传统的指数移动平均更新规则，并通过类比于指数移动平均和正态分布之间的关系，基于学生 t 分布衍生了一种 t-soft 更新方法。通过 PyBullet 机器人模拟 DRL 的训练过程，我们发现，使用 t-soft 更新的在线演员 - 评论家算法在得到的回报和 / 或其方差方面优于传统方法。

深度强化学习中目标网络的 t-Soft 更新

t-Soft Update of Target Network for Deep Reinforcement Learning

Reinforcement learning (RL) algorithms update an agent's parameters according
to one of several possible rules, discovered manually through years of
research. Automating the discovery of update rules from data could lead to more
efficient algorithms, or algorithms that are better adapted to specific
environments. Although there have been prior attempts at addressing this
significant scientific challenge, it remains an open question whether it is
feasible to discover alternatives to fundamental concepts of RL such as value
functions and temporal-difference learning. This paper introduces a new
meta-learning approach that discovers an entire update rule which includes both
'what to predict' (e.g. value functions) and 'how to learn from it' (e.g.
bootstrapping) by interacting with a set of environments. The output of this
method is an RL algorithm that we call Learned Policy Gradient (LPG). Empirical
results show that our method discovers its own alternative to the concept of
value functions. Furthermore it discovers a bootstrapping mechanism to maintain
and use its predictions. Surprisingly, when trained solely on toy environments,
LPG generalises effectively to complex Atari games and achieves non-trivial
performance. This shows the potential to discover general RL algorithms from
data.

该论文提出一种新的元学习方法，可以通过与一组环境交互，发现一个包含价值函数和时间差分学习等元素的更新规则，从而得到一个名为 LPG 的 RL 算法，该方法可以发现自己对于价值函数的替代方案，并有效地推广到复杂的 Atari 游戏中。