Enhancing reasoning in Large Multimodal Models (LMMs) faces unique challenges from the complex interplay between visual perception and logical reasoning, particularly in compact 3B-parameter architectures where architectural constraints limit reasoning capacity and modality alignment. While rule-based reinforcement learning (RL) excels in text-only domains, its multimodal extension confronts two critical barriers: (1) data limitations due to ambiguous answers and scarce complex reasoning examples, and (2) degraded foundational reasoning induced by multimodal pretraining. To address these challenges, we propose \textbf{LMM-R1}, a two-stage framework adapting rule-based RL for multimodal reasoning through \textbf{Foundational Reasoning Enhancement (FRE)} followed by \textbf{Multimodal Generalization Training (MGT)}. The FRE stage first strengthens reasoning abilities using text-only data with rule-based RL, then the MGT stage generalizes these reasoning capabilities to multimodal domains. Experiments on Qwen2.5-VL-Instruct-3B demonstrate that LMM-R1 achieves 4.83\% and 4.5\% average improvements over baselines in multimodal and text-only benchmarks, respectively, with a 3.63\% gain in complex Football Game tasks. These results validate that text-based reasoning enhancement enables effective multimodal generalization, offering a data-efficient paradigm that bypasses costly high-quality multimodal training data.

本研究解决了大型多模态模型在推理能力方面面临的挑战，特别是在3B参数架构中的限制造成的推理能力不足和模态对齐问题。提出的LMM-R1框架通过基础推理增强(FRE)和多模态泛化训练(MGT)两个阶段有效提升推理能力，实验结果显示，相较于基线，LMM-R1在多模态和文本基准上分别提升了4.83%和4.5%的平均表现，表明文本基础推理增强能够有效促进多模态的泛化，提供了一种数据高效的训练方式。

通过双阶段基于规则的强化学习提升3B模型的推理能力