Achieving sample efficiency in online episodic reinforcement learning (RL) requires optimally balancing exploration and exploitation. When it comes to a finite-horizon episodic markov decision process with $S$ states, $A$ actions and horizon length $H$, substantial progress has been ac