Using learned reward functions (LRFs) as a means to solve sparse-reward reinforcement learning (RL) tasks has yielded some steady progress in task-complexity through the years. In this work, we question whether today's LRFs are best-suited as a direct replacement for task rewards. Instead, we propose leveraging the capabilities of LRFs as a pretraining signal for RL. Concretely, we propose $\textbf{LA}$nguage Reward $\textbf{M}$odulated $\textbf{P}$retraining (LAMP) which leverages the zero-shot capabilities of Vision-Language Models (VLMs) as a $\textit{pretraining}$ utility for RL as opposed to a downstream task reward. LAMP uses a frozen, pretrained VLM to scalably generate noisy, albeit shaped exploration rewards by computing the contrastive alignment between a highly diverse collection of language instructions and the image observations of an agent in its pretraining environment. LAMP optimizes these rewards in conjunction with standard novelty-seeking exploration rewards with reinforcement learning to acquire a language-conditioned, pretrained policy. Our VLM pretraining approach, which is a departure from previous attempts to use LRFs, can warmstart sample-efficient learning on robot manipulation tasks in RLBench.

使用基于学习的奖励函数（LRFs）作为解决稀疏奖励强化学习（RL）任务的手段已经在任务复杂性方面取得了一些稳定的进展。本文提出了一种将LRFs作为RL的预训练信号的方法，即$	extbf{LA}$nguage Reward $	extbf{M}$odulated $	extbf{P}$retraining (LAMP)，其利用Vision-Language Models (VLMs)的零样本能力作为RL的预训练工具，而不是作为下游任务奖励。通过计算大量语言指令与代理器环境中的图像观察之间的对比对齐，LAMP使用冻结的预训练VLM生成嘈杂但有形状的探索奖励。LAMP与强化学习中的寻求新颖性的探索奖励一起优化这些奖励，以获得受语言条件约束的预训练策略。我们的VLM预训练方法与以前使用LRFs的方法不同，可以在RLBench的机器人操作任务上启动样本效率高的学习。

语言奖励调节预训练强化学习