Chain-of-thought (CoT) reasoning in vision language models (VLMs) is crucial for improving interpretability and trustworthiness. However, current training recipes lack robust CoT reasoning data, relying on datasets dominated by short annotations with minimal rationales. In this work, we show that training VLM on short answers does not generalize well to reasoning tasks that require more detailed responses. To address this, we propose a two-fold approach. First, we distill rationales from GPT-4o model to enrich the training data and fine-tune VLMs, boosting their CoT performance. Second, we apply reinforcement learning to further calibrate reasoning quality. Specifically, we construct positive (correct) and negative (incorrect) pairs of model-generated reasoning chains, by comparing their predictions with annotated short answers. Using this pairwise data, we apply the Direct Preference Optimization algorithm to refine the model's reasoning abilities. Our experiments demonstrate significant improvements in CoT reasoning on benchmark datasets and better generalization to direct answer prediction as well. This work emphasizes the importance of incorporating detailed rationales in training and leveraging reinforcement learning to strengthen the reasoning capabilities of VLMs.

本文解决了视觉语言模型（VLMs）在链式思维（CoT）推理中缺乏足够详细的训练数据的问题。通过从GPT-4o模型中提取推理依据丰富训练数据，并结合强化学习优化推理质量，显著提升了VLM在基准数据集上的性能和对直接答案预测的泛化能力。这项研究强调了在训练中融合详细推理依据的重要性，以及利用强化学习增强VLM推理能力的策略。

提升视觉语言模型的链式思维推理