Most existing policy learning solutions require the learning agents to
receive high-quality supervision signals such as well-designed rewards in
reinforcement learning (RL) or high-quality expert demonstrations in behavioral
cloning (BC). These quality supervisions are usually infeasib