Standard stochastic control methods assume that the probability distribution
of uncertain variables is available. Unfortunately, in practice, obtaining
accurate distribution information is a challenging task. To resolve this issue,
we investigate the problem of designing a control policy that is robust against
errors in the empirical distribution obtained from data. This problem can be
formulated as a two-player zero-sum dynamic game problem, where the action
space of the adversarial player is a Wasserstein ball centered at the empirical
distribution. We propose computationally tractable value and policy iteration
algorithms with explicit estimates of the number of iterations required for
constructing an $\epsilon$-optimal policy. We show that the contraction
property of associated Bellman operators extends a single-stage out-of-sample
performance guarantee, obtained using a measure concentration inequality, to
the corresponding multi-stage guarantee without any degradation in the
confidence level. In addition, we characterize an explicit form of the optimal
distributionally robust control policy and the worst-case distribution policy
for linear-quadratic problems with Wasserstein penalty. Our study indicates
that dynamic programming and Kantorovich duality play a critical role in
solving and analyzing the Wasserstein distributionally robust stochastic
control problems.

研究了一个基于 Wasserstein 分布的鲁棒控制策略问题，提出了一个可计算的值迭代算法和策略迭代算法，并通过动态规划和 Kantorovich 对偶理论的分析，在保证置信水平不降低的情况下，构造了一个多阶段性能保证和最优分布鲁棒控制策略。

Wasserstein 分布稳健随机控制：一种数据驱动的方法

Wasserstein Distributionally Robust Stochastic Control: A Data-Driven  Approach

We present a new method of learning control policies that successfully
operate under unknown dynamic models. We create such policies by leveraging a
large number of training examples that are generated using a physical
simulator. Our system is made of two components: a Universal Policy (UP) and a
function for Online System Identification (OSI). We describe our control policy
as universal because it is trained over a wide array of dynamic models. These
variations in the dynamic model may include differences in mass and inertia of
the robots' components, variable friction coefficients, or unknown mass of an
object to be manipulated. By training the Universal Policy with this variation,
the control policy is prepared for a wider array of possible conditions when
executed in an unknown environment. The second part of our system uses the
recent state and action history of the system to predict the dynamics model
parameters mu. The value of mu from the Online System Identification is then
provided as input to the control policy (along with the system state).
Together, UP-OSI is a robust control policy that can be used across a wide
range of dynamic models, and that is also responsive to sudden changes in the
environment. We have evaluated the performance of this system on a variety of
tasks, including the problem of cart-pole swing-up, the double inverted
pendulum, locomotion of a hopper, and block-throwing of a manipulator. UP-OSI
is effective at these tasks across a wide range of dynamic models. Moreover,
when tested with dynamic models outside of the training range, UP-OSI
outperforms the Universal Policy alone, even when UP is given the actual value
of the model dynamics. In addition to the benefits of creating more robust
controllers, UP-OSI also holds out promise of narrowing the Reality Gap between
simulated and real physical systems.

通过利用物理模拟器生成的大量训练样本，我们提出了一种新的学习控制策略的方法，该方法可以成功地在未知的动态模型下运行。我们的系统由通用策略（UP）和在线系统识别（OSI）函数两部分组成，通过 UP 的训练和在系统状态下提供的来自 OSI 的 mu 值进行控制，我们的 UP-OSI 是一个可以在各种动态模型下使用的鲁棒控制策略。