The standard feedback model of reinforcement learning requires revealing the
reward of every visited state-action pair. However, in practice, it is often
the case that such frequent feedback is not available. In this work, we take a
first step towards relaxing this assumption and requi