reward hacking occurs when an agent performs very well with respect to a
"proxy" reward function (which may be hand-specified or learned), but poorly
with respect to the unknown true reward. Since ensuring good alignment between
the proxy and true reward is extremely difficult, one app