A recent line of works showed regret bounds in reinforcement learning (RL) can be (nearly) independent of planning horizon, a.k.a.~the horizon-free bounds. However, these regret bounds only apply to settings where a polynomial dependency on the size of transition model is allowed, such as tabular Markov Decision Process (MDP) and linear mixture MDP. We give the first horizon-free bound for the popular linear MDP setting where the size of the transition model can be exponentially large or even uncountable. In contrast to prior works which explicitly estimate the transition model and compute the inhomogeneous value functions at different time steps, we directly estimate the value functions and confidence sets. We obtain the horizon-free bound by: (1) maintaining multiple weighted least square estimators for the value functions; and (2) a structural lemma which shows the maximal total variation of the inhomogeneous value functions is bounded by a polynomial factor of the feature dimension.

近期一些研究工作展示了强化学习中降低后悔的边界可以（几乎）与计划周期无关，即所谓的无周期边界。然而，这些后悔边界仅适用于允许对转移模型大小多项式依赖的设置，例如表格型马尔科夫决策过程（MDP）和线性混合MDP。我们给出了流行的线性MDP设置的首个无周期边界，其中转移模型的大小可以是指数级大甚至是不可数的。与先前的工作相比，该方法不需要明确估计转移模型并计算不同时间步的非齐次值函数，而是直接估计值函数和置信区间集合。通过保持多个加权最小二乘估计器，该方法获得了无周期边界，并且通过结构引理证明了非齐次值函数的最大总变差受特征维数的多项式因子限制。

线性马尔可夫决策过程的无界遗憾