Large vision-language models (VLMs) fine-tuned on specialized visual
instruction-following data have exhibited impressive language reasoning
capabilities across various scenarios. However, this fine-tuning paradigm may
not be able to efficiently learn optimal decision-making agents in multi-step
goal-directed tasks from interactive environments. To address this challenge,
we propose an algorithmic framework that fine-tunes VLMs with reinforcement
learning (RL). Specifically, our framework provides a task description and then
prompts the VLM to generate chain-of-thought (CoT) reasoning, enabling the VLM
to efficiently explore intermediate reasoning steps that lead to the final
text-based action. Next, the open-ended text output is parsed into an
executable action to interact with the environment to obtain goal-directed task
rewards. Finally, our framework uses these task rewards to fine-tune the entire
VLM with RL. Empirically, we demonstrate that our proposed framework enhances
the decision-making capabilities of VLM agents across various tasks, enabling
7b models to outperform commercial models such as GPT4-V or Gemini.
Furthermore, we find that CoT reasoning is a crucial component for performance
improvement, as removing the CoT reasoning results in a significant decrease in
the overall performance of our method.

使用强化学习对视觉语言模型进行微调，提出了一种算法框架来增强其决策能力，验证了连续思维推理的重要性，并展示了在各种任务中超越商业模型的性能。

通过强化学习将大型视觉语言模型细调为决策代理

Fine-Tuning Large Vision-Language Models as Decision-Making Agents via  Reinforcement Learning

While reinforcement learning algorithms provide automated acquisition of
optimal policies, practical application of such methods requires a number of
design decisions, such as manually designing reward functions that not only
define the task, but also provide sufficient shaping to accomplish it. In this
paper, we view reinforcement learning as inferring policies that achieve
desired outcomes, rather than as a problem of maximizing rewards. To solve this
inference problem, we establish a novel variational inference formulation that
allows us to derive a well-shaped reward function which can be learned directly
from environment interactions. From the corresponding variational objective, we
also derive a new probabilistic Bellman backup operator and use it to develop
an off-policy algorithm to solve goal-directed tasks. We empirically
demonstrate that this method eliminates the need to hand-craft reward functions
for a suite of diverse manipulation and locomotion tasks and leads to effective
goal-directed behaviors.

通过提出一种新的变分推断形式，从环境交互中直接学习良好的奖励函数，并使用新的概率贝尔曼反演运算符，发展了一种离线策略算法来解决目标导向任务，该方法消除了手工制作奖励函数的需要，并对各种机械操纵和运动任务产生了有效的目标导向行为。

通过变分推断实现基于结果的强化学习

Outcome-Driven Reinforcement Learning via Variational Inference

Causal reasoning has been an indispensable capability for humans and other
intelligent animals to interact with the physical world. In this work, we
propose to endow an artificial agent with the capability of causal reasoning
for completing goal-directed tasks. We develop learning-based approaches to
inducing causal knowledge in the form of directed acyclic graphs, which can be
used to contextualize a learned goal-conditional policy to perform tasks in
novel environments with latent causal structures. We leverage attention
mechanisms in our causal induction model and goal-conditional policy, enabling
us to incrementally generate the causal graph from the agent's visual
observations and to selectively use the induced graph for determining actions.
Our experiments show that our method effectively generalizes towards completing
new tasks in novel environments with previously unseen causal structures.

本文提出了一种利用有向无环图产生因果知识，辅助人工智能完成目标驱动任务的方法，并通过实验证明该方法可以有效地推广到在先前未见的具有新的因果结构的环境下完成新任务。