Reinforcement Learning with Human Feedback (RLHF) has received significant attention for performing tasks without the need for costly manual reward design by aligning human preferences. It is crucial to consider diverse human feedback types and various learning methods in different environments. However, quantifying progress in RLHF with diverse feedback is challenging due to the lack of standardized annotation platforms and widely used unified benchmarks. To bridge this gap, we introduce Uni-RLHF, a comprehensive system implementation tailored for RLHF. It aims to provide a complete workflow from real human feedback, fostering progress in the development of practical problems. Uni-RLHF contains three packages: 1) a universal multi-feedback annotation platform, 2) large-scale crowdsourced feedback datasets, and 3) modular offline RLHF baseline implementations. Uni-RLHF develops a user-friendly annotation interface tailored to various feedback types, compatible with a wide range of mainstream RL environments. We then establish a systematic pipeline of crowdsourced annotations, resulting in large-scale annotated datasets comprising more than 15 million steps across 30+ popular tasks. Through extensive experiments, the results in the collected datasets demonstrate competitive performance compared to those from well-designed manual rewards. We evaluate various design choices and offer insights into their strengths and potential areas of improvement. We wish to build valuable open-source platforms, datasets, and baselines to facilitate the development of more robust and reliable RLHF solutions based on realistic human feedback. The website is available at https://uni-rlhf.github.io/.

通过Uni-RLHF系统，我们提供了一套从真实人类反馈到实用问题开发中全面工作流的解决方案，包括通用的多反馈注释平台、大规模众包反馈数据集和模块化离线RLHF基线实现。通过广泛的实验，我们的结果表明，与精心设计的手动奖励相比，收集到的数据集在多个任务中表现出有竞争力的性能，同时也评估了各种设计选择并提出了改进的潜在领域。我们希望建立有价值的开源平台、数据集和基线，以便基于现实人类反馈来促进更强大、可靠的RLHF解决方案的开发。

Uni-RLHF: 强化学习通用平台和基准套件与多样化人类反馈