Learning from Demonstration (LfD) seeks to democratize robotics by enabling non-roboticist end-users to teach robots to perform a task by providing a human demonstration. However, modern LfD techniques, such as inverse reinforcement learning (IRL), assume users provide at least stochastically optimal demonstrations. This assumption fails to hold in all but the most isolated, controlled scenarios, reducing the ability to achieve the goal of empowering real end-users. Recent attempts to learn from sub-optimal demonstration leverage pairwise rankings through Preference-based Reinforcement Learning (PbRL) to infer a more optimal policy than the demonstration. However, we show that these approaches make incorrect assumptions and, consequently, suffer from brittle, degraded performance. In this paper, we overcome the limitations of prior work by developing a novel computational technique that infers an idealized reward function from suboptimal demonstration and bootstraps suboptimal demonstrations to synthesize optimality-parameterized training data for training our reward function. We empirically validate we can learn an idealized reward function with $\sim0.95$ correlation with the ground truth reward versus only $\sim 0.75$ for prior work. We can then train policies achieving $\sim 200\%$ improvement over the suboptimal demonstration and $\sim 90\%$ improvement over prior work. Finally, we present a real-world implementation for teaching a robot to hit a topspin shot in table tennis better than user demonstration.

本文提出了一种新的方法通过子优示范来合成优化参数化的数据来训练理想的奖励函数，从而克服了旧方法在使用子优示范时的一些限制，实现了更好的性能。

通过自监督奖励回归学习低效演示