Model-free deep reinforcement learning (RL) methods have been successful in a wide variety of simulated domains. However, a major obstacle facing deep RL in the real world is the high sample complexity of such methods. Unbiased batch policy-gradient methods offer stable learning, but a