TL;DR通过批处理逐渐优化查询的可能奖励函数的概率分布,在保证安全性的同时,提高效率和准确性,以及适应处理未知特征并对重要的 AI 模型进行调整。
Abstract
Designing a perfect reward function that depicts all the aspects of the
intended behavior is almost impossible, especially generalizing it outside of
the training environments. active inverse reward design (AIRD)