\textit{MinMaxMin} $Q$-learning is a novel \textit{optimistic} Actor-Critic
algorithm that addresses the problem of \textit{overestimation} bias
($Q$-estimations are overestimating the real $Q$-values) inherent in
\textit{conservative} RL algorithms. Its core formula relies on the
disagreement among $Q$-networks in the form of the min-batch MaxMin
$Q$-networks distance which is added to the $Q$-target and used as the priority
experience replay sampling-rule. We implement \textit{MinMaxMin} on top of TD3
and TD7, subjecting it to rigorous testing against state-of-the-art
continuous-space algorithms-DDPG, TD3, and TD7-across popular MuJoCo and Bullet
environments. The results show a consistent performance improvement of
\textit{MinMaxMin} over DDPG, TD3, and TD7 across all tested tasks.

MinMaxMin 是一种乐观的 Actor-Critic 算法，通过优先级经验回放的方式解决保守的强化学习算法中存在的过高估计偏差问题，实验证明 MinMaxMin 在所有测试任务中相比 DDPG、TD3 和 TD7 都能显著提高性能。

MinMaxMin Q 学习

\textit{MinMaxMin} $Q$-learning

Soft Actor-Critic (SAC) is an off-policy actor-critic deep reinforcement
learning (DRL) algorithm based on maximum entropy reinforcement learning. By
combining off-policy updates with an actor-critic formulation, SAC achieves
state-of-the-art performance on a range of continuous-action benchmark tasks,
outperforming prior on-policy and off-policy methods. The off-policy method
employed by SAC samples data uniformly from past experience when performing
parameter updates. We propose Emphasizing Recent Experience (ERE), a simple but
powerful off-policy sampling technique, which emphasizes recently observed data
while not forgetting the past. The ERE algorithm samples more aggressively from
recent experience, and also orders the updates to ensure that updates from old
data do not overwrite updates from new data. We compare vanilla SAC and
SAC+ERE, and show that ERE is more sample efficient than vanilla SAC for
continuous-action Mujoco tasks. We also consider combining SAC with Priority
Experience Replay (PER), a scheme originally proposed for deep Q-learning which
prioritizes the data based on temporal-difference (TD) error. We show that
SAC+PER can marginally improve the sample efficiency performance of SAC, but
much less so than SAC+ERE. Finally, we propose an algorithm which integrates
ERE and PER and show that this hybrid algorithm can give the best results for
some of the Mujoco tasks.

Soft Actor-Critic 算法利用最大熵强化学习实现无序策略的演员 - 评论家深度强化学习，结合灵活的离线更新和演员评论家框架，通过实验表明强化学习离线采样技术 Emphasizing Recent Experience (ERE) 能进一步提高 SAC 的效率，在混合 ERE 和 Priority Experience Replay 方法中得到了更好的结果。