In large-scale machine learning, recent works have studied the effects of compressing gradients in stochastic optimization in order to alleviate the communication bottleneck. These works have collectively revealed that stochastic gradient descent (SGD) is robust to structured perturbations such as quantization, sparsification, and delays. Perhaps surprisingly, despite the surge of interest in large-scale, multi-agent reinforcement learning, almost nothing is known about the analogous question: Are common reinforcement learning (RL) algorithms also robust to similar perturbations? In this paper, we investigate this question by studying a variant of the classical temporal difference (TD) learning algorithm with a perturbed update direction, where a general compression operator is used to model the perturbation. Our main technical contribution is to show that compressed TD algorithms, coupled with an error-feedback mechanism used widely in optimization, exhibit the same non-asymptotic theoretical guarantees as their SGD counterparts. We then extend our results significantly to nonlinear stochastic approximation algorithms and multi-agent settings. In particular, we prove that for multi-agent TD learning, one can achieve linear convergence speedups in the number of agents while communicating just $\tilde{O}(1)$ bits per agent at each time step. Our work is the first to provide finite-time results in RL that account for general compression operators and error-feedback in tandem with linear function approximation and Markovian sampling. Our analysis hinges on studying the drift of a novel Lyapunov function that captures the dynamics of a memory variable introduced by error feedback.

本文研究了带有压缩算子的强化学习过程对经典时间差分学习算法的影响，并证明了在误差反馈机制的作用下，以及与线性函数逼近和马尔可夫采样一起使用时，压缩的时间差分算法可以与SGD相似地具有非渐近理论保证。此外，本文还扩展了结果，提出了多智能体TD学习的线性收敛速度快速提升的证明。

利用压缩更新的时序差分学习：误差反馈与强化学习相遇