TL;DR本文提出了一种名为行为期望范围(BEB)的理论方法,用于正式研究大型语言模型对齐的几种固有特性和限制,揭示了 LMM 对齐的根本局限性,并凸显了确保 AI 安全的可靠机制的必要性。
Abstract
An important aspect in developing language models that interact with humans
is aligning their behavior to be useful and unharmful for their human users.
This is usually achieved by tuning the model in a way that enhances desired
behaviors and inhibits undesired ones, a process referred