Real-world reinforcement learning (RL) is often severely limited since
typical RL algorithms heavily rely on the reset mechanism to sample proper
initial states. In practice, the reset mechanism is expensive to implement due
to the need for human intervention or heavily engineered environments. To make
learning more practical, we propose a generic no-regret reduction to
systematically design reset-free RL algorithms. Our reduction turns reset-free
RL into a two-player game. We show that achieving sublinear regret in this
two-player game would imply learning a policy that has both sublinear
performance regret and sublinear total number of resets in the original RL
problem. This means that the agent eventually learns to perform optimally and
avoid resets. By this reduction, we design an instantiation for linear Markov
decision processes, which is the first provably correct reset-free RL algorithm
to our knowledge.

提出了一种重置免费的强化学习算法，将重置免费 RL 转化为两个玩家的博弈，以达到次线性性能失误和次线性重置总数。此外，提出的线性马尔可夫决策过程实例是第一个经过证明的重置免费 RL 算法。

可证明的无需重制强化学习算法

Provable Reset-free Reinforcement Learning by No-Regret Reduction

By exploiting the computing power and local data of distributed clients,
federated learning (FL) features ubiquitous properties such as reduction of
communication overhead and preserving data privacy. In each communication round
of FL, the clients update local models based on their own data and upload their
local updates via wireless channels. However, latency caused by hundreds to
thousands of communication rounds remains a bottleneck in FL. To minimize the
training latency, this work provides a multi-armed bandit-based framework for
online client scheduling (CS) in FL without knowing wireless channel state
information and statistical characteristics of clients. Firstly, we propose a
CS algorithm based on the upper confidence bound policy (CS-UCB) for ideal
scenarios where local datasets of clients are independent and identically
distributed (i.i.d.) and balanced. An upper bound of the expected performance
regret of the proposed CS-UCB algorithm is provided, which indicates that the
regret grows logarithmically over communication rounds. Then, to address
non-ideal scenarios with non-i.i.d. and unbalanced properties of local datasets
and varying availability of clients, we further propose a CS algorithm based on
the UCB policy and virtual queue technique (CS-UCB-Q). An upper bound is also
derived, which shows that the expected performance regret of the proposed
CS-UCB-Q algorithm can have a sub-linear growth over communication rounds under
certain conditions. Besides, the convergence performance of FL training is also
analyzed. Finally, simulation results validate the efficiency of the proposed
algorithms.

本文提出了一个基于多臂赌博机策略的在线客户端调度（CS）框架，用于减少联邦学习中数百到数千个通信轮延迟。两个基于上置信区间（UCB）策略的 CS 算法（CS-UCB 和 CS-UCB-Q）被提出以应对不理想的本地数据集的非独立、不平衡属性和客户可用性的变化。本文还分析了 FL 训练的收敛性能，并且模拟结果验证了所提出的算法的有效性。