This paper presents the first non-asymptotic result showing that a model-free
algorithm can achieve a logarithmic cumulative regret for episodic tabular
reinforcement learning if there exists a strictly positive sub-optimality gap
in the optimal $Q$-function. We prove that the optimist