This paper critically evaluates the attempts to align Artificial Intelligence
(AI) systems, especially Large Language Models (LLMs), with human values and
intentions through Reinforcement Learning from Feedback (RLxF) methods,
involving either human feedback (RLHF) or AI feedback (RLAIF). Specifically, we
show the shortcomings of the broadly pursued alignment goals of honesty,
harmlessness, and helpfulness. Through a multidisciplinary sociotechnical
critique, we examine both the theoretical underpinnings and practical
implementations of RLxF techniques, revealing significant limitations in their
approach to capturing the complexities of human ethics and contributing to AI
safety. We highlight tensions and contradictions inherent in the goals of RLxF.
In addition, we discuss ethically-relevant issues that tend to be neglected in
discussions about alignment and RLxF, among which the trade-offs between
user-friendliness and deception, flexibility and interpretability, and system
safety. We conclude by urging researchers and practitioners alike to critically
assess the sociotechnical ramifications of RLxF, advocating for a more nuanced
and reflective approach to its application in AI development.

本文批评性地评估了通过强化学习从反馈中对齐人工智能系统，特别是大规模语言模型，与人的价值观和意图的尝试，包括人的反馈和人工智能的反馈。具体来说，我们展示了广泛追求的诚实、无害和有帮助的对齐目标的不足。通过多学科社会技术批判，我们考察了 RLxF 技术的理论基础和实践实现，揭示了其在捕捉人类伦理复杂性和促进人工智能安全方面的重要局限性。我们强调了 RLxF 目标中固有的张力和矛盾。此外，我们讨论了在关于对齐和 RLxF 的讨论中往往被忽视的道德相关问题，其中包括用户友好与欺骗、灵活性与可解释性、系统安全之间的权衡。我们最后敦促研究人员和从业者在评估 RLxF 的社会技术后果时进行批判性评估，倡导在人工智能开发中采用更细致、反思的方法。

通过人类反馈进行强化学习的 AI 对齐？矛盾和限制

AI Alignment through Reinforcement Learning from Human Feedback?  Contradictions and Limitations

Big models, exemplified by Large Language Models (LLMs), are models typically
pre-trained on massive data and comprised of enormous parameters, which not
only obtain significantly improved performance across diverse tasks but also
present emergent capabilities absent in smaller models. However, the growing
intertwining of big models with everyday human lives poses potential risks and
might cause serious social harm. Therefore, many efforts have been made to
align LLMs with humans to make them better follow user instructions and satisfy
human preferences. Nevertheless, `what to align with' has not been fully
discussed, and inappropriate alignment goals might even backfire. In this
paper, we conduct a comprehensive survey of different alignment goals in
existing work and trace their evolution paths to help identify the most
essential goal. Particularly, we investigate related works from two
perspectives: the definition of alignment goals and alignment evaluation. Our
analysis encompasses three distinct levels of alignment goals and reveals a
goal transformation from fundamental abilities to value orientation, indicating
the potential of intrinsic human values as the alignment goal for enhanced
LLMs. Based on such results, we further discuss the challenges of achieving
such intrinsic value alignment and provide a collection of available resources
for future research on the alignment of big models.

通过综合调查现有工作的不同对齐目标并追踪其演变路径，本文揭示了从基本能力到价值取向的目标转变，表明内在人类价值可能是提升大型语言模型对齐目标的关键，进一步讨论了实现此内在价值对齐的挑战，并提供了一系列可用资源以支持未来对大型模型对齐的研究。