We study the problem of teaching via demonstrations in sequential decision-making tasks. In particular, we focus on the situation when the teacher has no access to the learner's model and policy, and the feedback from the learner is limited to trajectories that start from states selected by the teacher. The necessity to select the starting states and infer the learner's policy creates an opportunity for using the methods of inverse reinforcement learning and active learning by the teacher. In this work, we formalize the teaching process with limited feedback and propose an algorithm that solves this teaching problem. The algorithm uses a modified version of the active value-at-risk method to select the starting states, a modified maximum causal entropy algorithm to infer the policy, and the difficulty score ratio method to choose the teaching demonstrations. We test the algorithm in a synthetic car driving environment and conclude that the proposed algorithm is an effective solution when the learner's feedback is limited.

我们研究了在顺序决策任务中通过示范进行教学的问题，特别关注教师无法访问学习者的模型和策略，仅有由教师选择的起始状态的轨迹作为反馈的情况。我们通过有限反馈的教学过程进行形式化，并提出了解决该教学问题的算法。该算法使用了改进的主动风险价值法来选择起始状态，改进的最大因果熵算法来推断策略，并使用困难度评分比方法来选择教学示范。我们在合成的汽车驾驶环境中对该算法进行了测试，并得出结论：当学习者的反馈有限时，所提出的算法是一种有效的解决方案。

有限反馈下交互式教授逆强化学习器