Whittle index policy is a heuristic to the intractable restless multi-armed
bandits (RMAB) problem. Although it is provably asymptotically optimal, finding
Whittle indices remains difficult. In this paper, we present Neural-Q-Whittle,
a Whittle index based Q-learning algorithm for RMAB with neural network
function approximation, which is an example of nonlinear two-timescale
stochastic approximation with Q-function values updated on a faster timescale
and Whittle indices on a slower timescale. Despite the empirical success of
deep Q-learning, the non-asymptotic convergence rate of Neural-Q-Whittle, which
couples neural networks with two-timescale Q-learning largely remains unclear.
This paper provides a finite-time analysis of Neural-Q-Whittle, where data are
generated from a Markov chain, and Q-function is approximated by a ReLU neural
network. Our analysis leverages a Lyapunov drift approach to capture the
evolution of two coupled parameters, and the nonlinearity in value function
approximation further requires us to characterize the approximation error.
Combing these provide Neural-Q-Whittle with $\mathcal{O}(1/k^{2/3})$
convergence rate, where $k$ is the number of iterations.

基于神经网络函数逼近的 Whittle 指数的 Q - 学习算法 Neural-Q-Whittle 解决不断变化的多臂赌博机问题，通过耦合两个时间尺度的 Q - 函数和 Whittle 指数，提供了 Neueral-Q-Whittle 的收敛速率为 O (1/k^(2/3)) 的有限时间分析。

基于 Whittle 指数的有限时间分析：带有神经网络函数逼近的不安定多臂赌博机上的 Q 学习

Finite-Time Analysis of Whittle Index based Q-Learning for Restless  Multi-Armed Bandits with Neural Network Function Approximation

We consider a hybrid reinforcement learning setting (Hybrid RL), in which an
agent has access to an offline dataset and the ability to collect experience
via real-world online interaction. The framework mitigates the challenges that
arise in both pure offline and online RL settings, allowing for the design of
simple and highly effective algorithms, in both theory and practice. We
demonstrate these advantages by adapting the classical Q learning/iteration
algorithm to the hybrid setting, which we call Hybrid Q-Learning or Hy-Q. In
our theoretical results, we prove that the algorithm is both computationally
and statistically efficient whenever the offline dataset supports a
high-quality policy and the environment has bounded bilinear rank. Notably, we
require no assumptions on the coverage provided by the initial distribution, in
contrast with guarantees for policy gradient/iteration methods. In our
experimental results, we show that Hy-Q with neural network function
approximation outperforms state-of-the-art online, offline, and hybrid RL
baselines on challenging benchmarks, including Montezuma's Revenge.

本文介绍一种混合强化学习算法 Hy-Q，利用离线数据集和在线实时交互来提高算法设计的效率并最终在 Montezuma's Revenge 等测试数据上将混合强化学习算法的表现优于同类算法。

混合强化学习：利用离线和在线数据都可使强化学习更加高效

Hybrid RL: Using Both Offline and Online Data Can Make RL Efficient

Many reinforcement learning algorithms rely on value estimation, however, the
most widely used algorithms -- namely temporal difference algorithms -- can
diverge under both off-policy sampling and nonlinear function approximation.
Many algorithms have been developed for off-policy value estimation based on
the linear mean squared projected Bellman error (MSPBE) and are sound under
linear function approximation. Extending these methods to the nonlinear case
has been largely unsuccessful. Recently, several methods have been introduced
that approximate a different objective -- the mean-squared Bellman error (MSBE)
-- which naturally facilitate nonlinear approximation. In this work, we build
on these insights and introduce a new generalized MSPBE that extends the linear
MSPBE to the nonlinear setting. We show how this generalized objective unifies
previous work and obtain new bounds for the value error of the solutions of the
generalized objective. We derive an easy-to-use, but sound, algorithm to
minimize the generalized objective, and show that it is more stable across
runs, is less sensitive to hyperparameters, and performs favorably across four
control domains with neural network function approximation.

本文介绍了一种基于非线性机器学习的强化学习算法，该算法使用一种新的广义均方投影贝尔曼误差作为目标函数，可提高算法的稳定性和性能。