Before deploying autonomous agents in the real world, we need to be confident they will perform safely in novel situations. Ideally, we would expose agents to a very wide range of situations during training, allowing them to learn about every possible danger, but this is often impractical. This paper investigates safety and generalization from a limited number of training environments in deep reinforcement learning (RL). We find RL algorithms can fail dangerously on unseen test environments even when performing perfectly on training environments. Firstly, in a gridworld setting, we show that catastrophes can be significantly reduced with simple modifications, including ensemble model averaging and the use of a blocking classifier. In the more challenging CoinRun environment we find similar methods do not significantly reduce catastrophes. However, we do find that the uncertainty information from the ensemble is useful for predicting whether a catastrophe will occur within a few steps and hence whether human intervention should be requested.

本文研究深度强化学习中有限的训练环境对安全和泛化性能的影响，通过模型平均和使用阻塞分类器等简单方法，可显著降低在网格世界中的灾难情况，但在 CoinRun 环境中会存在一定失败率，然而可以通过系集的不确定性信息来预测是否需要人类干预。

安全关键的强化学习中基于少量环境的泛化