This work considers the problem of provably optimal reinforcement learning for (episodic) finite horizon MDPs, i.e. how an agent learns to maximize his/her (long term) reward in an uncertain environment. The main contribution is in providing a novel algorithm --- Variance-reduced Upper Confidence Q-learning (vUCQ) --- which enjoys a regret bound of $\widetil