Aligning Large Language Models (LLMs) is crucial for enhancing their safety and utility. However, existing methods, primarily based on preference datasets, face challenges such as noisy labels, high annotation costs, and privacy concerns. In this work, we introduce Alignment from Demonstrations (AfD), a novel approach leveraging high-quality demonstration data to overcome these challenges. We formalize AfD within a sequential decision-making framework, highlighting its unique challenge of missing reward signals. Drawing insights from forward and inverse reinforcement learning, we introduce divergence minimization objectives for AfD. Analytically, we elucidate the mass-covering and mode-seeking behaviors of various approaches, explaining when and why certain methods are superior. Practically, we propose a computationally efficient algorithm that extrapolates over a tailored reward model for AfD. We validate our key insights through experiments on the Harmless and Helpful tasks, demonstrating their strong empirical performance while maintaining simplicity.

利用高质量的演示数据，我们提出了一种名为AfD的新方法，通过在顺序决策框架中形式化AfD，解决了诸如噪声标签、高昂的注释成本和隐私问题等挑战，我们通过引入分歧最小化目标来解决AfD独特的缺失奖励信号的问题，并提出了一个在定制奖励模型上超出的计算效率算法，通过在Harmless和Helpful任务上的实验证明了我们的关键见解，展示了其强大的经验性能，并保持了简洁性。

反转-RL对齐：基于示范的反推强化学习用于LLM对齐