This paper studies reward-agnostic exploration in reinforcement learning (RL) -- a scenario where the learner is unware of the reward functions during the exploration stage -- and designs an algorithm that improves over the state of the art. More precisely, consider a finite-horizon non-stationary Markov decision process with $S$ states, $A$ actions, and horizon length $H$, and suppose that there are no more than a polynomial number of given reward functions of interest. By collecting an order of \begin{align*} \frac{SAH^3}{\varepsilon^2} \text{ sample episodes (up to log factor)} \end{align*} without guidance of the reward information, our algorithm is able to find $\varepsilon$-optimal policies for all these reward functions, provided that $\varepsilon$ is sufficiently small. This forms the first reward-agnostic exploration scheme in this context that achieves provable minimax optimality. Furthermore, once the sample size exceeds $\frac{S^2AH^3}{\varepsilon^2}$ episodes (up to log factor), our algorithm is able to yield $\varepsilon$ accuracy for arbitrarily many reward functions (even when they are adversarially designed), a task commonly dubbed as ``reward-free exploration.'' The novelty of our algorithm design draws on insights from offline RL: the exploration scheme attempts to maximize a critical reward-agnostic quantity that dictates the performance of offline RL, while the policy learning paradigm leverages ideas from sample-optimal offline RL paradigms.

本文研究了强化学习中的无奖励探索并设计了一种算法来提高其效率，该算法不需要提前了解奖励函数。算法最多需要采集SAH^3/ε^2个样本轨迹就能对于所有感兴趣的奖励函数找到ε-optimal策略，而且算法还能在样本量超过S^2AH^3/ε^2个轨迹时无限找到ε-optimal 策略，即便这些奖励函数是对抗性设计的。

强化学习中最小化-最优化奖励无关探索