Learning a reward function from human preferences is challenging as it
typically requires having a high-fidelity simulator or using expensive and
potentially unsafe actual physical rollouts in the environment. However, in
many tasks the agent might have access to offline data from rela