Off-policy actor-critic algorithms have shown promise in deep reinforcement
learning for continuous control tasks. Their success largely stems from
leveraging pessimistic state-action value function updates, which effectively
address function approximation errors and improve performance. However, such
pessimism can lead to under-exploration, constraining the agent's ability to
explore/refine its policies. Conversely, optimism can counteract
under-exploration, but it also carries the risk of excessive risk-taking and
poor convergence if not properly balanced. Based on these insights, we
introduce Utility Soft Actor-Critic (USAC), a novel framework within the
actor-critic paradigm that enables independent control over the degree of
pessimism/optimism for both the actor and the critic via interpretable
parameters. USAC adapts its exploration strategy based on the uncertainty of
critics through a utility function that allows us to balance between pessimism
and optimism separately. By going beyond binary choices of optimism and
pessimism, USAC represents a significant step towards achieving balance within
off-policy actor-critic algorithms. Our experiments across various continuous
control problems show that the degree of pessimism or optimism depends on the
nature of the task. Furthermore, we demonstrate that USAC can outperform
state-of-the-art algorithms for appropriately configured pessimism/optimism
parameters.

通过利用悲观的状态 - 动作值函数更新，以及通过可解释参数独立控制悲观 / 乐观程度，Utility Soft Actor-Critic (USAC) 在离策略演员 - 评论家算法中实现了平衡，可以根据任务的性质，在恰当配置的悲观 / 乐观参数情况下胜过现有算法。