Policy entropy regularization is commonly used for better exploration in deep reinforcement learning (RL). However, policy entropy regularization is sample-inefficient in off-policy learning since it does not take the distribution of previous samples stored in the replay buffer into account. In order to take advantage of the previous sample distribution from the replay buffer for sample-efficient exploration, we propose sample-aware entropy regularization which maximizes the entropy of weighted sum of the policy action distribution and the sample action distribution from the replay buffer. We formulate the problem of sample-aware entropy regularized policy iteration, prove its convergence, and provide a practical algorithm named diversity actor-critic (DAC) which is a generalization of soft actor-critic (SAC). Numerical results show that DAC outperforms SAC and other state-of-the-art RL algorithms.

提出了基于样本感知的策略熵正则化方法，以增强传统策略熵正则化方法用于探索的性能；通过利用回放缓存中可获取的样本分布，最大化加权和策略行为分布和缓存中样本行为分布的熵来完成对样本高效的探索。并基于提出的样本感知熵正则化方法，开发出了一个名为多样性演员-评论家算法（DAC）的实用算法，并通过数值实验获得了在增强学习应用中的显著性能优势。

多元化演员-评论家: 针对样本高效探索的样本感知熵正则化