In this work we introduce the application of black-box quantum control as an
interesting rein- forcement learning problem to the machine learning community.
We analyze the structure of the reinforcement learning problems arising in
quantum physics and argue that agents parameterized by long short-term memory
(LSTM) networks trained via stochastic policy gradients yield a general method
to solving them. In this context we introduce a variant of the proximal policy
optimization (PPO) algorithm called the memory proximal policy optimization
(MPPO) which is based on this analysis. We then show how it can be applied to
specific learning tasks and present results of nu- merical experiments showing
that our method achieves state-of-the-art results for several learning tasks in
quantum control with discrete and continouous control parameters.

本文介绍了黑盒量子控制作为一个有趣的强化学习问题在机器学习领域的应用，并分析了在量子物理中出现的强化学习问题的结构，提出了通过受随机策略梯度训练长短期记忆（LSTM）网络进行参数化的代理，提供了解决这些问题的一般方法，引入了一种基于此分析的近端策略优化（PPO）算法的变体，称为内存近端策略优化（MPPO），并展示了如何将其应用于特定的学习任务，并呈现了数字实验的结果，表明我们的方法在离散和连续控制参数的量子控制的几项学习任务中实现了最先进的结果。

通过实验计算梯度：使用 LSTM 和记忆近端策略优化进行黑盒量子控制

Taking gradients through experiments: LSTMs and memory proximal policy  optimization for black-box quantum control

We propose expected policy gradients (EPG), which unify stochastic policy
gradients (SPG) and deterministic policy gradients (DPG) for reinforcement
learning. Inspired by expected sarsa, EPG integrates across the action when
estimating the gradient, instead of relying only on the action in the sampled
trajectory. We establish a new general policy gradient theorem, of which the
stochastic and deterministic policy gradient theorems are special cases. We
also prove that EPG reduces the variance of the gradient estimates without
requiring deterministic policies and, for the Gaussian case, with no
computational overhead. Finally, we show that it is optimal in a certain sense
to explore with a Gaussian policy such that the covariance is proportional to
the exponential of the scaled Hessian of the critic with respect to the
actions. We present empirical results confirming that this new form of
exploration substantially outperforms DPG with the Ornstein-Uhlenbeck heuristic
in four challenging MuJoCo domains.

论文提出了一种集成了随机策略梯度和确定性策略梯度的预期策略梯度，通过对动作的积分来估算梯度，证明了其可以降低梯度估算的方差，对于高斯探索，通过设置动作的海森矩阵的指数作为协方差比标准探索更优，在四个 MuJoCo 域中明显优于使用奥恩斯坦 - 乌伦贝克启发式的确定性策略梯度.