Pre-trained Vision-Language Models (VLMs) are able to understand visual
concepts, describe and decompose complex tasks into sub-tasks, and provide
feedback on task completion. In this paper, we aim to leverage these
capabilities to support the training of reinforcement learning (RL) agents. In
principle, VLMs are well suited for this purpose, as they can naturally analyze
image-based observations and provide feedback (reward) on learning progress.
However, inference in VLMs is computationally expensive, so querying them
frequently to compute rewards would significantly slowdown the training of an
RL agent. To address this challenge, we propose a framework named Code as
Reward (VLM-CaR). VLM-CaR produces dense reward functions from VLMs through
code generation, thereby significantly reducing the computational burden of
querying the VLM directly. We show that the dense rewards generated through our
approach are very accurate across a diverse set of discrete and continuous
environments, and can be more effective in training RL policies than the
original sparse environment rewards.

利用预训练的视觉语言模型（VLMs）来支持强化学习代理的训练，提出了一种名为 VLM-CaR 的框架，通过代码生成从 VLMs 生成密集奖励函数，从而大大减轻了直接查询 VLM 的计算负担，证明了该方法在各种离散和连续环境中生成的密集奖励非常准确，并且可以比原始的稀疏环境奖励更有效地训练强化学习策略。