offline reinforcement learning (RL) methodologies enforce constraints on the
policy to adhere closely to the behavior policy, thereby stabilizing value
learning and mitigating the selection of out-of-distribution (OOD) actions
during test time. Conventional approaches apply identical c