Recent advancements in visual language models (VLMs) have notably enhanced their capabilities in handling complex Graphical User Interface (GUI) interaction tasks. Despite these improvements, current frameworks often struggle to generate correct actions in challenging GUI environments. State-of-the-art commercial VLMs are black-boxes, and fine-tuning open-source VLMs for GUI tasks requires significant resources. Additionally, existing trajectory-level evaluation and refinement techniques frequently fall short due to delayed feedback and local optimization issues. To address these challenges, we propose an approach that guides VLM agents with process supervision by a reward model during GUI navigation and control at inference time. This guidance allows the VLM agent to optimize actions at each inference step, thereby improving performance in both static and dynamic environments. In particular, our method demonstrates significant performance gains in three GUI navigation tasks, achieving a 3.4% improvement in single step action accuracy for static environments, along with a around 33% increase in task success rate in one dynamic environment. With further integration of trajectory reflection and retry mechanisms, we also demonstrate even greater enhancement in task success.

本研究针对现有视觉语言模型在复杂图形用户界面交互任务中的表现不足，提出了一种在推理时通过奖励模型给 VLM 代理提供过程监督的方法。该方法提升了 VLM 代理在静态和动态环境中的行动准确性和任务成功率，静态环境下一步行动准确率提高了 3.4%，而在动态环境中的任务成功率提高了约 33%。

在推理时使用过程奖励指导 VLM 代理进行 GUI 导航