BriefGPT.xyz
May, 2025
推理模型不总是如其所想表达
Reasoning Models Don't Always Say What They Think
HTML
PDF
Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison...
TL;DR
本文探讨了链式推理(CoT)在AI安全中的潜在价值,但发现推理模型的CoT可信度不足。尽管基于结果的强化学习在初期提高了CoT的可信度,但这一过程并未持续提升,表明CoT监控能够帮助识别训练和评估中的不良行为,但不足以完全消除这些行为。
Abstract
Chain-of-Thought
(CoT) offers a potential boon for
AI Safety
as it allows
Monitoring
a model's CoT to try to understand its intentions and
→