We analyze the type of learned optimization that occurs when a learned model (such as a neural network) is itself an optimizer - a situation we refer to as mesa-optimization, a neologism we introduce in this paper. We believe that the possibility of mesa-optimization raises two important questions for the safety and transparency of advanced machine learning systems. First, under what circumstances will learned models be optimizers, including when they should not be? Second, when a learned model is an optimizer, what will its objective be - how will it differ from the loss function it was trained under - and how can it be aligned? In this paper, we provide an in-depth analysis of these two primary questions and provide an overview of topics for future research.

本文分析了当学习模型（例如神经网络）本身是优化器时所发生的学习优化类型，称之为mesa-optimization。我们认为，mesa-optimization的可能性对于先进机器学习系统的安全性和透明度提出了两个重要问题。第一，什么情况下学习模型会成为优化器，包括不应成为优化器的情况？第二，当学习模型是优化器时，它的目标将是什么，它将如何不同于它所训练的损失函数，如何进行对齐？本文对这两个主要问题进行了深入分析，并提供了未来研究的主题概述。

高级机器学习系统中学习优化的风险