We study reinforcement learning (RL) with linear function approximation where the underlying transition probability kernel of the Markov decision process (MDP) is a linear mixture model (Jia et al., 2020; Ayoub et al., 2020; Zhou et al., 2020) and the learning agent has access to either an integration or a sampling oracle of the individual basis kernels. We propose a new Bernstein-type concentration inequality for self-normalized martingales for linear bandit problems with bounded noise. Based on the new inequality, we propose a new, computationally efficient algorithm with linear function approximation named $\text{UCRL-VTR}^{+}$ for the aforementioned linear mixture MDPs in the episodic undiscounted setting. We show that $\text{UCRL-VTR}^{+}$ attains an $\tilde O(dH\sqrt{T})$ regret where $d$ is the dimension of feature mapping, $H$ is the length of the episode and $T$ is the number of interactions with the MDP. We also prove a matching lower bound $\Omega(dH\sqrt{T})$ for this setting, which shows that $\text{UCRL-VTR}^{+}$ is minimax optimal up to logarithmic factors. In addition, we propose the $\text{UCLK}^{+}$ algorithm for the same family of MDPs under discounting and show that it attains an $\tilde O(d\sqrt{T}/(1-\gamma)^{1.5})$ regret, where $\gamma\in [0,1)$ is the discount factor. Our upper bound matches the lower bound $\Omega(d\sqrt{T}/(1-\gamma)^{1.5})$ proved in Zhou et al. (2020) up to logarithmic factors, suggesting that $\text{UCLK}^{+}$ is nearly minimax optimal. To the best of our knowledge, these are the first computationally efficient, nearly minimax optimal algorithms for RL with linear function approximation.

本文研究了具有线性函数逼近的增强学习，其中马尔科夫决策过程（MDP）的潜在转移概率核心为线性混合模型，并且学习代理具有单个基础核函数的积分或采样神谕的访问。 基于我们提出的新的Bernstein型自归一类化不等式，我们提出了一种名为$	ext{UCRL-VTR}^{+}$的新的计算有效算法，以进行具有线性函数逼近的线性混合MDPs的无折扣情况。 我们还提出了新的算法$	ext{UCLK}^{+}$，适用于同一类MDP的折扣情况，这两种算法分别在最小化最大性上达到了近乎最小值，是线性函数逼近RL的第一篇计算有效性，近乎最小值的论文。

线性混合Markov决策过程的近最小极小化强化学习