Generative models of language exhibit impressive capabilities but still place non-negligible probability mass over undesirable outputs. In this work, we address the task of updating a model to avoid unwanted outputs while minimally changing model behavior otherwise, a challenge we refer to as a minimal targeted update. We first formalize the notion of a minimal targeted update and propose a method to achieve such updates using negative examples from a model's generations. Our proposed Targeted Negative Training (TNT) results in updates that keep the new distribution close to the original, unlike existing losses for negative signal which push down probability but do not control what the updated distribution will be. In experiments, we demonstrate that TNT yields a better trade-off between reducing unwanted behavior and maintaining model generation behavior than baselines, paving the way towards a modeling paradigm based on iterative training updates that constrain models from generating undesirable outputs while preserving their impressive capabilities.

提出了一种名为目标负向训练（Targeted Negative Training，TNT）的方法，通过使用模型生成的负面样本，实现了最小化目标化更新，以避免生成不希望的结果，同时最小程度地改变模型的行为。TNT方法在减少不需要的行为和保持模型生成行为之间取得了更好的平衡，为基于迭代训练更新、限制生成不希望结果的模型范式铺平了道路。

针对性负训练实现语言模型的最小目标更新