Reinforcement learning with verifiable outcome rewards (RLVR) has effectively scaled up chain-of-thought (CoT) reasoning in large language models (LLMs). Yet, its efficacy in training vision-language model (VLM) agents for goal-directed action reasoning in visual environments is less established. This work investigates this problem through extensive experiments on complex card games, such as 24 points, and embodied tasks from ALFWorld. We find that when rewards are based solely on action outcomes, RL fails to incentivize CoT reasoning in VLMs, instead leading to a phenomenon we termed thought collapse, characterized by a rapid loss of diversity in the agent's thoughts, state-irrelevant and incomplete reasoning, and subsequent invalid actions, resulting in negative rewards. To counteract thought collapse, we highlight the necessity of process guidance and propose an automated corrector that evaluates and refines the agent's reasoning at each RL step. This simple and scalable GTR (Guided Thought Reinforcement) framework trains reasoning and action simultaneously without the need for dense, per-step human labeling. Our experiments demonstrate that GTR significantly enhances the performance and generalization of the LLaVA-7b model across various visual environments, achieving 3-5 times higher task success rates compared to SoTA models with notably smaller model sizes.

本研究针对基于强化学习的视觉语言模型代理在视觉环境中进行目标导向推理时效果不佳的问题。提出了一种自动纠正机制的GTR框架，通过在每个强化学习步骤中评估和细化代理的推理过程，有效防止了思维崩溃现象，显著提高了模型的任务成功率和泛化能力。实验表明，与最新的模型相比，在各类视觉环境下GTR实现了3-5倍的任务成功率提升。

GTR：指导性思维强化防止基于强化学习的视觉语言模型代理训练中的思维崩溃