reinforcement learning (RL) has achieved impressive performance in a variety
of online settings in which an agent's ability to query the environment for
transitions and rewards is effectively unlimited. However, in many practical
applications, the situation is reversed: an agent may ha