This paper critically evaluates the attempts to align Artificial Intelligence (AI) systems, especially Large Language Models (LLMs), with human values and intentions through Reinforcement Learning from Feedback (RLxF) methods, involving either human feedback (RLHF) or AI feedback (RLAIF). Specifically, we show the shortcomings of the broadly pursued alignment goals of honesty, harmlessness, and helpfulness. Through a multidisciplinary sociotechnical critique, we examine both the theoretical underpinnings and practical implementations of RLxF techniques, revealing significant limitations in their approach to capturing the complexities of human ethics and contributing to AI safety. We highlight tensions and contradictions inherent in the goals of RLxF. In addition, we discuss ethically-relevant issues that tend to be neglected in discussions about alignment and RLxF, among which the trade-offs between user-friendliness and deception, flexibility and interpretability, and system safety. We conclude by urging researchers and practitioners alike to critically assess the sociotechnical ramifications of RLxF, advocating for a more nuanced and reflective approach to its application in AI development.

本文批评性地评估了通过强化学习从反馈中对齐人工智能系统，特别是大规模语言模型，与人的价值观和意图的尝试，包括人的反馈和人工智能的反馈。具体来说，我们展示了广泛追求的诚实、无害和有帮助的对齐目标的不足。通过多学科社会技术批判，我们考察了RLxF技术的理论基础和实践实现，揭示了其在捕捉人类伦理复杂性和促进人工智能安全方面的重要局限性。我们强调了RLxF目标中固有的张力和矛盾。此外，我们讨论了在关于对齐和RLxF的讨论中往往被忽视的道德相关问题，其中包括用户友好与欺骗、灵活性与可解释性、系统安全之间的权衡。我们最后敦促研究人员和从业者在评估RLxF的社会技术后果时进行批判性评估，倡导在人工智能开发中采用更细致、反思的方法。

通过人类反馈进行强化学习的AI对齐? 矛盾和限制