We consider a setting for Inverse Reinforcement Learning (IRL) where the learner is extended with the ability to actively select multiple environments, observing an agent's behavior on each environment. We first demonstrate that if the learner can experiment with any transition dynamics on some fixed set of states and actions, then there exists an algorithm that reconstructs the agent's reward function to the fullest extent theoretically possible, and that requires only a small (logarithmic) number of experiments. We contrast this result to what is known about IRL in single fixed environments, namely that the true reward function is fundamentally unidentifiable. We then extend this setting to the more realistic case where the learner may not select any transition dynamic, but rather is restricted to some fixed set of environments that it may try. We connect the problem of maximizing the information derived from experiments to submodular function maximization and demonstrate that a greedy algorithm is near optimal (up to logarithmic factors). Finally, we empirically validate our algorithm on an environment inspired by behavioral psychology.

考虑逆强化学习的设置，其中学习者扩展了主动选择多个环境的能力，从而观察代理在每个环境中的行为。我们首先展示了，如果学习者可以在一些固定的状态和行动集上尝试任何过渡动态，那么存在一种重建代理奖励函数的算法，其理论上可能性最大，并且仅需要少量(对数级别)的实验。接着，我们将这个设置扩展到更加现实的情况，即学习者可能无法选择任何转移动态，而是受到一些固定环境的限制。我们将实验中得到的信息最大化问题与次模函数最大化联系起来，并展示了贪心算法是近似最优的(对数因子)。最后，我们在一个受行为心理学启发的环境中对我们的算法进行了实证验证。

逆强化学习中解决不可识别性问题的研究