Reward-free data is abundant and contains rich prior knowledge of human behaviors, but it is not well exploited by offline reinforcement learning (RL) algorithms. In this paper, we propose UBER, an unsupervised approach to extract useful behaviors from offline reward-free datasets via diversified rewards. UBER assigns different pseudo-rewards sampled from a given prior distribution to different agents to extract a diverse set of behaviors, and reuse them as candidate policies to facilitate the learning of new tasks. Perhaps surprisingly, we show that rewards generated from random neural networks are sufficient to extract diverse and useful behaviors, some even close to expert ones. We provide both empirical and theoretical evidence to justify the use of random priors for the reward function. Experiments on multiple benchmarks showcase UBER's ability to learn effective and diverse behavior sets that enhance sample efficiency for online RL, outperforming existing baselines. By reducing reliance on human supervision, UBER broadens the applicability of RL to real-world scenarios with abundant reward-free data.

本篇研究提出了一种基于无奖励数据的离线强化学习算法 UBER，通过多样化的奖励机制从无奖励数据集中提取有用的行为，并将其作为候选策略来促进新任务的学习。实证和理论证据表明，从随机神经网络生成的奖励足以提取出多样且有用的行为，甚至有些接近于专家级行为。实验结果显示 UBER 可以学习到有效且多样的行为集，提高在线强化学习的样本效率，并优于现有基准算法。通过减少对人工监督的依赖，UBER 扩展了强化学习在丰富的无奖励数据实际场景中的适用性。

基于随机意图先验的无监督行为提取