We present a new algorithm for the contextual bandit learning problem, where
the learner repeatedly takes one of $K$ actions in response to the observed
context, and observes the reward only for that chosen action. Our method
assumes access to an oracle for solving fully supervised cos