Reinforcement learning (RL) is a classical tool to solve network control or policy optimization problems in unknown environments. The original Q-learning suffers from performance and complexity challenges across very large networks. Herein, a novel model-free ensemble reinforcement learning algorithm which adapts the classical Q-learning is proposed to handle these challenges for networks which admit Markov decision process (MDP) models. Multiple Q-learning algorithms are run on multiple, distinct, synthetically created and structurally related Markovian environments in parallel; the outputs are fused using an adaptive weighting mechanism based on the Jensen-Shannon divergence (JSD) to obtain an approximately optimal policy with low complexity. The theoretical justification of the algorithm, including the convergence of key statistics and Q-functions are provided. Numerical results across several network models show that the proposed algorithm can achieve up to 55% less average policy error with up to 50% less runtime complexity than the state-of-the-art Q-learning algorithms. Numerical results validate assumptions made in the theoretical analysis.

提出了一种新颖的模型无关的集合强化学习算法，通过在多个合成的与马尔可夫决策过程相关的环境上运行多个Q学习算法，并使用基于Jensen-Shannon差异的自适应加权机制来融合输出，获得具有低复杂度的近似最优策略。与最先进的Q学习算法相比，数值实验结果显示，该算法平均策略误差可以减少高达55％，运行时复杂度可以减少高达50％，并验证了理论分析中的假设。

多时间尺度集成 Q-learning 用于马尔科夫决策过程策略优化