We adapt the ideas underlying the success of Deep Q-Learning to the
continuous action domain. We present an actor-critic, model-free algorithm
based on the deterministic policy gradient that can operate over continuous
action spaces. Using the same learning algorithm, network architecture and
hyper-parameters, our algorithm robustly solves more than 20 simulated physics
tasks, including classic problems such as cartpole swing-up, dexterous
manipulation, legged locomotion and car driving. Our algorithm is able to find
policies whose performance is competitive with those found by a planning
algorithm with full access to the dynamics of the domain and its derivatives.
We further demonstrate that for many of the tasks the algorithm can learn
policies end-to-end: directly from raw pixel inputs.

本论文将 Deep Q-Learning 算法应用于连续动作域，并提出了一种基于确定性策略梯度的演员 - 评论家模型无模型算法，可在连续动作空间中进行操作，成功解决了 20 多个模拟物理任务，并能与完全访问动态并了解其导数的规划算法相竞争，并证明该算法对许多任务能够进行端到端学习。