Evaluating the generalisation capabilities of multimodal models based solely on their performance on out-of-distribution data fails to capture their true robustness. This work introduces a comprehensive evaluation framework that systematically examines the role of instructions and inputs in the generalisation abilities of such models, considering architectural design, input perturbations across language and vision modalities, and increased task complexity. The proposed framework uncovers the resilience of multimodal models to extreme instruction perturbations and their vulnerability to observational changes, raising concerns about overfitting to spurious correlations. By employing this evaluation framework on current Transformer-based multimodal models for robotic manipulation tasks, we uncover limitations and suggest future advancements should focus on architectural and training innovations that better integrate multimodal inputs, enhancing a model's generalisation prowess by prioritising sensitivity to input content over incidental correlations.

通过引入一种全面的评估框架，该研究系统地研究了指令和输入在多模态模型的广义能力中的作用，考虑了体系结构设计、以及语言和视觉模态中输入扰动以及任务复杂性的增加，揭示了多模态模型对极端指令扰动的韧性以及对观察变化的脆弱性，关注过度拟合偶然相关性的问题。通过将此评估框架应用于当前基于Transformer的多模态模型的机器人操作任务中，发现了一些限制，并建议未来的进展应专注于体系结构和训练创新，更好地整合多模态输入，通过优先考虑对输入内容的敏感性而不是偶然相关性，提高模型的广义化能力。

探索指导类型和任务难度在机器人操纵任务中的角色