Recently, there has been considerable attention towards leveraging large language models (LLMs) to enhance decision-making processes. However, aligning the natural language text instructions generated by LLMs with the vectorized operations required for execution presents a significant challenge, often necessitating task-specific details. To circumvent the need for such task-specific granularity, inspired by preference-based policy learning approaches, we investigate the utilization of multimodal LLMs to provide automated preference feedback solely from image inputs to guide decision-making. In this study, we train a multimodal LLM, termed CriticGPT, capable of understanding trajectory videos in robot manipulation tasks, serving as a critic to offer analysis and preference feedback. Subsequently, we validate the effectiveness of preference labels generated by CriticGPT from a reward modeling perspective. Experimental evaluation of the algorithm's preference accuracy demonstrates its effective generalization ability to new tasks. Furthermore, performance on Meta-World tasks reveals that CriticGPT's reward model efficiently guides policy learning, surpassing rewards based on state-of-the-art pre-trained representation models.

通过使用多模式语言模型从图像输入中提供自动化的偏好反馈来指导决策的研究，展示了一种能够理解机器人操纵任务中轨迹视频的多模式语言模型 CriticGPT，该模型能够提供分析和偏好反馈，并验证了所生成的偏好标签的有效性，实验评估表明其对新任务具有有效的泛化能力，并在 Meta-World 任务上的性能展示了 CriticGPT 的奖励模型能有效指导策略学习，超越了基于最新的预训练表示模型的奖励。

借助多模态的大型语言模型增强机器人操作的人工智能反馈