offline policy learning aims to discover decision-making policies from
previously-collected datasets without additional online interactions with the
environment. As the training dataset is fixed, its quality becomes a crucial
determining factor in the performance of the learned policy.