As multimodal foundational models start being deployed experimentally in Self-Driving cars, a reasonable question we ask ourselves is how similar to humans do these systems respond in certain driving situations -- especially those that are out-of-distribution? To study this, we create the Robusto-1 dataset that uses dashcam video data from Peru, a country with one of the worst (aggressive) drivers in the world, a high traffic index, and a high ratio of bizarre to non-bizarre street objects likely never seen in training. In particular, to preliminarly test at a cognitive level how well Foundational Visual Language Models (VLMs) compare to Humans in Driving, we move away from bounding boxes, segmentation maps, occupancy maps or trajectory estimation to multi-modal Visual Question Answering (VQA) comparing both humans and machines through a popular method in systems neuroscience known as Representational Similarity Analysis (RSA). Depending on the type of questions we ask and the answers these systems give, we will show in what cases do VLMs and Humans converge or diverge allowing us to probe on their cognitive alignment. We find that the degree of alignment varies significantly depending on the type of questions asked to each type of system (Humans vs VLMs), highlighting a gap in their alignment.

本研究针对多模态基础模型在自主驾驶中的响应能力进行探讨，特别是在分布外情境下的表现，填补了这一领域的研究空白。我们提出Robusto-1数据集，利用秘鲁的行车记录视频进行比较，通过多模态视觉问答方法，发现人类与视觉语言模型在认知层面上的一致性与差异性显著取决于提问的类型，这揭示了两者认知对齐的差距。

Robusto-1 数据集：比较人类与视觉语言模型在秘鲁真实分布外自主驾驶中的表现