The $Q$-learning algorithm is a simple and widely-used stochastic approximation scheme for reinforcement learning, but the basic protocol can exhibit instability in conjunction with function approximation. Such instability can be observed even with linear function approximation. In practice, tools such as target networks and experience replay appear to be essential, but the individual contribution of each of these mechanisms is not well understood theoretically. This work proposes an exploration variant of the basic $Q$-learning protocol with linear function approximation. Our modular analysis illustrates the role played by each algorithmic tool that we adopt: a second order update rule, a set of target networks, and a mechanism akin to experience replay. Together, they enable state of the art regret bounds on linear MDPs while preserving the most prominent feature of the algorithm, namely a space complexity independent of the number of step elapsed. We show that the performance of the algorithm degrades very gracefully under a novel and more permissive notion of approximation error. The algorithm also exhibits a form of instance-dependence, in that its performance depends on the "effective" feature dimension.

本文讨论了$Q$-learning算法的不稳定性问题，提出了一种基于探索的改进方案。该算法通过结合二阶更新，目标网络等机制，实现了线性MDPs的最新遗憾界限，并且算法设计独立于时间步长。此外，该算法表现出一定的实例依赖性，并且在近似误差更为宽松的条件下的性能下降比较缓慢。

使用线性结构稳定Q学习，以实现证明有效的学习