We consider online reinforcement learning (RL) in episodic Markov decision
processes (MDPs) under the linear $q^\pi$-realizability assumption, where it is
assumed that the action-values of all policies can be expressed as linear
functions of state-action features. This class is known to be more general than
linear MDPs, where the transition kernel and the reward function are assumed to
be linear functions of the feature vectors. As our first contribution, we show
that the difference between the two classes is the presence of states in
linearly $q^\pi$-realizable MDPs where for any policy, all the actions have
approximately equal values, and skipping over these states by following an
arbitrarily fixed policy in those states transforms the problem to a linear
MDP. Based on this observation, we derive a novel (computationally inefficient)
learning algorithm for linearly $q^\pi$-realizable MDPs that simultaneously
learns what states should be skipped over and runs another learning algorithm
on the linear MDP hidden in the problem. The method returns an
$\epsilon$-optimal policy after $\text{polylog}(H, d)/\epsilon^2$ interactions
with the MDP, where $H$ is the time horizon and $d$ is the dimension of the
feature vectors, giving the first polynomial-sample-complexity online RL
algorithm for this setting. The results are proved for the misspecified case,
where the sample complexity is shown to degrade gracefully with the
misspecification error.

在线强化学习中的线性可实现的马尔可夫决策过程 (MDP)，提出了一种计算效率较低的学习算法，通过跳过特定状态转化为线性 MDP，并证明了该算法在这种情况下具有多项式样本复杂度。