We study the exploration problem with approximate linear action-value
functions in episodic reinforcement learning under the notion of low inherent
Bellman error, a condition normally employed to show convergence of approximate
value iteration. First we relate this condition to other common frameworks and
show that it is strictly more general than the low rank (or linear) MDP
assumption of prior work. Second we provide an algorithm with a high
probability regret bound $\widetilde O(\sum_{t=1}^H d_t \sqrt{K} + \sum_{t=1}^H
\sqrt{d_t} \IBE K)$ where $H$ is the horizon, $K$ is the number of episodes,
$\IBE$ is the value if the inherent Bellman error and $d_t$ is the feature
dimension at timestep $t$. In addition, we show that the result is unimprovable
beyond constants and logs by showing a matching lower bound. This has two
important consequences: 1) it shows that exploration is possible using only
\emph{batch assumptions} with an algorithm that achieves the optimal
statistical rate for the setting we consider, which is more general than prior
work on low-rank MDPs 2) the lack of closedness (measured by the inherent
Bellman error) is only amplified by $\sqrt{d_t}$ despite working in the online
setting. Finally, the algorithm reduces to the celebrated \textsc{LinUCB} when
$H=1$ but with a different choice of the exploration parameter that allows
handling misspecified contextual linear bandits. While computational
tractability questions remain open for the MDP setting, this enriches the class
of MDPs with a linear representation for the action-value function where
statistically efficient reinforcement learning is possible.

研究在近似线性行动价值函数的情况下，基于低内在 Bellman 误差的探索问题，给出了一种算法，其高概率的遗憾上界与特征维数和 Bellman 误差有关，同时将其与先前的工作进行了比较，在线性 MDP 的情况下，证明了这个算法具有统计效率。