This paper critically evaluates the attempts to align Artificial Intelligence
(AI) systems, especially Large Language Models (LLMs), with human values and
intentions through Reinforcement Learning from Feedback (RLxF) methods,
involving either human feedback (RLHF) or AI feedback (RLAIF). Specifically, we
show the shortcomings of the broadly pursued alignment goals of honesty,
harmlessness, and helpfulness. Through a multidisciplinary sociotechnical
critique, we examine both the theoretical underpinnings and practical
implementations of RLxF techniques, revealing significant limitations in their
approach to capturing the complexities of human ethics and contributing to AI
safety. We highlight tensions and contradictions inherent in the goals of RLxF.
In addition, we discuss ethically-relevant issues that tend to be neglected in
discussions about alignment and RLxF, among which the trade-offs between
user-friendliness and deception, flexibility and interpretability, and system
safety. We conclude by urging researchers and practitioners alike to critically
assess the sociotechnical ramifications of RLxF, advocating for a more nuanced
and reflective approach to its application in AI development.

本文批评性地评估了通过强化学习从反馈中对齐人工智能系统，特别是大规模语言模型，与人的价值观和意图的尝试，包括人的反馈和人工智能的反馈。具体来说，我们展示了广泛追求的诚实、无害和有帮助的对齐目标的不足。通过多学科社会技术批判，我们考察了 RLxF 技术的理论基础和实践实现，揭示了其在捕捉人类伦理复杂性和促进人工智能安全方面的重要局限性。我们强调了 RLxF 目标中固有的张力和矛盾。此外，我们讨论了在关于对齐和 RLxF 的讨论中往往被忽视的道德相关问题，其中包括用户友好与欺骗、灵活性与可解释性、系统安全之间的权衡。我们最后敦促研究人员和从业者在评估 RLxF 的社会技术后果时进行批判性评估，倡导在人工智能开发中采用更细致、反思的方法。