offline evaluation of reinforcement learning algorithms based on collected data (state transitions and rewards) has remained a challenging problem. Common practice is to create a simulator based on collected data and then run the algorithm against this simulator. Such an approach invol