In this paper we are introducing a new reinforcement learning method for
control problems in environments with delayed feedback. Specifically, our
method employs stochastic planning, versus previous methods that used
deterministic planning. This allows us to embed risk preference in the policy
optimization problem. We show that this formulation can recover the optimal
policy for problems with deterministic transitions. We contrast our policy with
two prior methods from literature. We apply the methodology to simple tasks to
understand its features. Then, we compare the performance of the methods in
controlling multiple Atari games.

本文介绍了一种用于具有延迟反馈环境中的控制问题的新的强化学习方法，该方法采用了随机规划而非以前使用的确定性规划方法，从而在策略优化问题中嵌入了风险偏好。我们展示了该方法能够恢复具有确定性转换的问题的最优策略，并将其与文献中的两种先前方法进行对比。我们将该方法应用于简单任务以了解其特点，然后比较了这些方法在控制多个 Atari 游戏方面的性能。

延迟随机环境中的控制：基于模型的强化学习方法

Control in Stochastic Environment with Delays: A Model-based  Reinforcement Learning Approach

Stochastic planning can be reduced to probabilistic inference in large
discrete graphical models, but hardness of inference requires approximation
schemes to be used. In this paper we argue that such applications can be
disentangled along two dimensions. The first is the direction of information
flow in the idealized exact optimization objective, i.e., forward vs backward
inference. The second is the type of approximation used to compute this
objective, e.g., Belief Propagation (BP) vs mean field variational inference
(MFVI). This new categorization allows us to unify a large amount of isolated
efforts in prior work explaining their connections and differences as well as
potential improvements. An extensive experimental evaluation over large
stochastic planning problems shows the advantage of forward BP over several
algorithms based on MFVI. An analysis of practical limitations of MFVI
motivates a novel algorithm, collapsed state variational inference (CSVI),
which provides a tighter approximation and achieves comparable planning
performance with forward BP.

该论文将随机规划分解成两个维度：正向和反向推断以及置信传播法和均值场变分推断等不同方法，进而提出折叠状态变分推断 (CSVI) 算法，并通过实验比较发现其与正向置信传播法是最佳的随机规划方法之一。