BriefGPT.xyz
Oct, 2024
强偏好影响价值对齐的稳健性
Strong Preferences Affect the Robustness of Value Alignment
HTML
PDF
Ziwei Xu, Mohan Kankanhalli
TL;DR
本文研究了价值对齐的稳健性,揭示偏好模型对偏好变化的敏感性。我们发现,在Bradley-Terry和Plackett-Luce模型中,某些偏好的概率会随着其他偏好的变化而显著改变,尤其是在主导偏好的情况下。这一发现对AI系统的价值对齐的稳健性和安全性具有重要影响。
Abstract
Value Alignment
, which aims to ensure that large language models (LLMs) and other AI agents behave in accordance with human values, is critical for ensuring safety and trustworthiness of these systems. A key component of
→