In reward-free reinforcement learning (RL), an agent explores the environment first without any reward information, in order to achieve certain learning goals afterwards for any given reward. In this paper we focus on reward-free RL under low-rank MDP models, in which both the representation and linear weight vectors are unknown. Although various algorithms have been proposed for reward-free low-rank MDPs, the corresponding sample complexity is still far from being satisfactory. In this work, we first provide the first known sample complexity lower bound that holds for any algorithm under low-rank MDPs. This lower bound implies it is strictly harder to find a near-optimal policy under low-rank MDPs than under linear MDPs. We then propose a novel model-based algorithm, coined RAFFLE, and show it can both find an $\epsilon$-optimal policy and achieve an $\epsilon$-accurate system identification via reward-free exploration, with a sample complexity significantly improving the previous results. Such a sample complexity matches our lower bound in the dependence on $\epsilon$, as well as on $K$ in the large $d$ regime, where $d$ and $K$ respectively denote the representation dimension and action space cardinality. Finally, we provide a planning algorithm (without further interaction with true environment) for RAFFLE to learn a near-accurate representation, which is the first known representation learning guarantee under the same setting.

本文研究了低秩MDP模型下无奖励强化学习及其算法。首先推导出该问题的最低样本复杂度下限，并提出了 RAFFLE 算法，通过无奖励探索可以在样本复杂度显著改善下找到epsilon-最优策略和epsilon-准确的系统识别。最后，提出了 RAFFLE 的规划算法，可学习到近乎准确的表示，是该环境下首个具有表示学习保证的算法。

低秩MDPs下无奖励强化学习的样本复杂度改进