We present a reduction from reinforcement learning (RL) to no-regret online
learning based on the saddle-point formulation of RL, by which "any" online
algorithm with sublinear regret can generate policies with provable performance
guarantees. This new perspective decouples the RL prob