We present a unified framework for learning continuous control policies using
backpropagation. It supports stochastic control by treating stochasticity in
the Bellman equation as a deterministic function of exogenous noise. The
product is a spectrum of general policy gradient algorithms that range from
model-free methods with value functions to model-based methods without value
functions. We use learned models but only require observations from the
environment in- stead of observations from model-predicted trajectories,
minimizing the impact of compounded model errors. We apply these algorithms
first to a toy stochastic control problem and then to several physics-based
control problems in simulation. One of these variants, SVG(1), shows the
effectiveness of learning models, value functions, and policies simultaneously
in continuous domains.

本文提出了一种使用反向传播学习连续控制策略的统一框架，并通过将贝尔曼方程中的随机性视为外源噪声的确定性函数，来支持随机控制。结果是一系列从有值函数的无模型方法到无值函数的有模型方法的通用策略梯度算法谱。我们使用学习模型，但只需要来自环境的观察而不是模型预测轨迹的观察，最大程度地减少复合模型错误的影响。我们首先将这些算法应用于一个玩具随机控制问题，然后在模拟中将其应用于几个基于物理的控制问题。其中一种变体 SVG（1）显示了在连续领域同时学习模型，价值函数和策略的有效性。

通过随机值梯度学习连续控制策略

Learning Continuous Control Policies by Stochastic Value Gradients

Managing risk in dynamic decision problems is of cardinal importance in many
fields such as finance and process control. The most common approach to
defining risk is through various variance related criteria such as the Sharpe
Ratio or the standard deviation adjusted reward. It is known that optimizing
many of the variance related risk criteria is NP-hard. In this paper we devise
a framework for local policy gradient style algorithms for reinforcement
learning for variance related criteria. Our starting point is a new formula for
the variance of the cost-to-go in episodic tasks. Using this formula we develop
policy gradient algorithms for criteria that involve both the expected cost and
the variance of the cost. We prove the convergence of these algorithms to local
minima and demonstrate their applicability in a portfolio planning problem.

本文提出了一种新的公式来计算环节任务成本的方差，并使用该公式提出了基于局部策略梯度算法的风险管理框架，进一步研究了涉及成本期望和成本方差的准则，最终在投资组合计划问题中应用。