A significant amount of research in recent years has been dedicated towards single agent deep reinforcement learning. Much of the success of deep reinforcement learning can be attributed towards the use of experience replay memories within which state transitions are stored. Function approximation methods such as convolutional neural networks (referred to as deep Q-Networks, or DQNs, in this context) can subsequently be trained through sampling the stored transitions. However, considerations are required when using experience replay memories within multi-agent systems, as stored transitions can become outdated due to agents updating their respective policies in parallel [1]. In this work we apply leniency [2] to multi-agent deep reinforcement learning (MA-DRL), acting as a control mechanism to determine which state-transitions sampled are allowed to update the DQN. Our resulting Lenient-DQN (LDQN) is evaluated using variations of the Coordinated Multi-Agent Object Transportation Problem (CMOTP) outlined by Busoniu et al. [3]. The LDQN significantly outperforms the existing hysteretic DQN (HDQN) [4] within environments that yield stochastic rewards. Based on results from experiments conducted using vanilla and double Q-learning versions of the lenient and hysteretic algorithms, we advocate a hybrid approach where learners initially use vanilla Q-learning before transitioning to double Q-learners upon converging on a cooperative joint policy.

本研究将宽容度应用于多智能体深度强化学习中，通过控制弱化负面策略更新所用的温度值，引入了乐观主义来更新价值函数，从而促进了协作，并在长期规划进程中很可能收敛到最优策略。实证评估表明，相比于标准和调度HDQN代理，LDQN代理更有可能在具有随机奖励的任务中收敛到最优策略。

宽松的多智能体深度强化学习