Designing a perfect reward function that depicts all the aspects of the
intended behavior is almost impossible, especially generalizing it outside of
the training environments. Active Inverse Reward Design (AIRD) proposed the use
of a series of queries, comparing possible reward functions in a single
training environment. This allows the human to give information to the agent
about suboptimal behaviors, in order to compute a probability distribution over
the intended reward function. However, it ignores the possibility of unknown
features appearing in real-world environments, and the safety measures needed
until the agent completely learns the reward function. I improved this method
and created Risk-averse Batch Active Inverse Reward Design (RBAIRD), which
constructs batches, sets of environments the agent encounters when being used
in the real world, processes them sequentially, and, for a predetermined number
of iterations, asks queries that the human needs to answer for each environment
of the batch. After this process is completed in one batch, the probabilities
have been improved and are transferred to the next batch. This makes it capable
of adapting to real-world scenarios and learning how to treat unknown features
it encounters for the first time. I also integrated a risk-averse planner,
similar to that of Inverse Reward Design (IRD), which samples a set of reward
functions from the probability distribution and computes a trajectory that
takes the most certain rewards possible. This ensures safety while the agent is
still learning the reward function, and enables the use of this approach in
situations where cautiousness is vital. RBAIRD outperformed the previous
approaches in terms of efficiency, accuracy, and action certainty, demonstrated
quick adaptability to new, unknown features, and can be more widely used for
the alignment of crucial, powerful AI models.

通过批处理逐渐优化查询的可能奖励函数的概率分布，在保证安全性的同时，提高效率和准确性，以及适应处理未知特征并对重要的 AI 模型进行调整。