In many real-life reinforcement learning (RL) problems, deploying new policies is costly. In those scenarios, algorithms must solve exploration (which requires adaptivity) while switching the deployed policy sparsely (which limits adaptivity). In this paper, we go beyond the existing state-of-the-art on this problem that focused on linear Markov Decision Processes (MDPs) by considering linear Bellman-complete MDPs with low inherent Bellman error. We propose the ELEANOR-LowSwitching algorithm that achieves the near-optimal regret with a switching cost logarithmic in the number of episodes and linear in the time-horizon $H$ and feature dimension $d$. We also prove a lower bound proportional to $dH$ among all algorithms with sublinear regret. In addition, we show the ``doubling trick'' used in ELEANOR-LowSwitching can be further leveraged for the generalized linear function approximation, under which we design a sample-efficient algorithm with near-optimal switching cost.

本研究提出了一种新的算法ELEANOR-LowSwitching，它在低固有贝尔曼误差的线性贝尔曼完成马尔可夫决策过程中实现了近乎最优的遗憾，轻量级的开销只是具有对数期和特征维度的情况，同时， 我们还证明了该算法具有次线性遗憾的所有算法之间成比例的下限，针对一般化的线性函数逼近，该算法可以被进一步利用利用它的“翻倍技巧”，我们设计了一个样本效率高且开销接近最优的算法。

超越线性马尔可夫决策过程中的对数切换成本在强化学习中应用