We consider a class of reinforcement-learning systems in which the agent
follows a behavior policy to explore a discrete state-action space to find an
optimal policy while adhering to some restriction on its behavior. Such
restriction may prevent the agent from visiting some state-acti