In this paper, we study the problem of efficient online reinforcement learning in the infinite horizon setting when there is an offline dataset to start with. We assume that the offline dataset is generated by an expert but with unknown level of competence, i.e., it is not perfect and not necessarily using the optimal policy. We show that if the learning agent models the behavioral policy (parameterized by a competence parameter) used by the expert, it can do substantially better in terms of minimizing cumulative regret, than if it doesn't do that. We establish an upper bound on regret of the exact informed PSRL algorithm that scales as $\tilde{O}(\sqrt{T})$. This requires a novel prior-dependent regret analysis of Bayesian online learning algorithms for the infinite horizon setting. We then propose an approximate Informed RLSVI algorithm that we can interpret as performing imitation learning with the offline dataset, and then performing online learning.

本文研究了在线强化学习问题在无限时间段环境中的高效解决方法，其中假设有一个离线数据集作为起点，由一个未知能力水平的专家生成，我们展示了如果学习代理建模了专家使用的行为策略，它可以在最小化累计遗憾方面表现得更好，我们建立了一个前瞻性依赖先验的遗憾界限，提出了近似的被告知RLSVI算法，可以解释为使用离线数据集进行模仿学习，然后进行在线学习。

无穷时间MDP的离线数据高效在线学习: 一种贝叶斯方法