Model-free deep reinforcement learning (RL) algorithms have been successfully applied to a range of challenging sequential decision making and control tasks. However, these methods typically suffer from two major challenges: high sample complexity and brittleness to hyperparameters. Both of these challenges limit the applicability of such methods to real-world domains. In this paper, we describe Soft Actor-Critic (SAC), our recently introduced off-policy actor-critic algorithm based on the maximum entropy RL framework. In this framework, the actor aims to simultaneously maximize expected return and entropy. That is, to succeed at the task while acting as randomly as possible. We extend SAC to incorporate a number of modifications that accelerate training and improve stability with respect to the hyperparameters, including a constrained formulation that automatically tunes the temperature hyperparameter. We systematically evaluate SAC on a range of benchmark tasks, as well as real-world challenging tasks such as locomotion for a quadrupedal robot and robotic manipulation with a dexterous hand. With these improvements, SAC achieves state-of-the-art performance, outperforming prior on-policy and off-policy methods in sample-efficiency and asymptotic performance. Furthermore, we demonstrate that, in contrast to other off-policy algorithms, our approach is very stable, achieving similar performance across different random seeds. These results suggest that SAC is a promising candidate for learning in real-world robotics tasks.

本文介绍了一种基于最大熵强化学习框架的离线演员-评论家算法 Soft Actor-Critic，其中演员旨在同时最大化期望回报和熵，以在任务中成功执行尽可能随机的动作。作者通过对其进行一系列改进，如约束模型等，提高了模型的稳定性和训练速度，并在基准任务以及四足机器人的运动和灵巧手的机器人操作等现实世界挑战任务中取得了最先进的性能，在样本效率和渐近性能方面优于以往的在线和离线算法。

软性演员-评论家算法及其应用