In this paper, we consider model-free federated reinforcement learning for
tabular episodic Markov decision processes. Under the coordination of a central
server, multiple agents collaboratively explore the environment and learn an
optimal policy without sharing their raw data. Despite recent advances in
federated Q-learning algorithms achieving near-linear regret speedup with low
communication cost, existing algorithms only attain suboptimal regrets compared
to the information bound. We propose a novel model-free federated Q-learning
algorithm, termed FedQ-Advantage. Our algorithm leverages reference-advantage
decomposition for variance reduction and operates under two distinct
mechanisms: synchronization between the agents and the server, and policy
update, both triggered by events. We prove that our algorithm not only requires
a lower logarithmic communication cost but also achieves an almost optimal
regret, reaching the information bound up to a logarithmic factor and
near-linear regret speedup compared to its single-agent counterpart when the
time horizon is sufficiently large.

本文介绍了一种模型无关的联邦增强学习算法，称为 FedQ-Advantage，它利用基于参考优势的分解进行方差降低，并在两个不同的机制下运行：代理与服务器之间的同步和策略更新，两者都由事件触发。我们证明了我们的算法不仅需要更低的对数通信成本，而且在时间跨度足够大的情况下，达到了信息界的几乎最优遗憾，并且较其单一代理对应物获得了近线性遗憾加速。