An effective method for combining frozen large language models (LLM) and visual encoders involves a resampler module that creates a `visual prompt' which is provided to the LLM, along with the textual prompt. While this approach has enabled impressive performance across many coarse-grained tasks like image captioning and visual question answering, more fine-grained tasks that require spatial understanding have not been thoroughly examined. In this paper, we use \textit{diagnostic classifiers} to measure the extent to which the visual prompt produced by the resampler encodes spatial information. Our results show that this information is largely absent from the resampler output when kept frozen during training of the classifiers. However, when the resampler and classifier are trained jointly, we observe a significant performance boost. This shows that the compression achieved by the resamplers can in principle encode the requisite spatial information, but that more object-aware objectives are needed at the pretraining stage to facilitate this capability

通过使用诊断分类器测量重新采样器生成的视觉提示的空间信息，我们发现在对分类器进行训练时，冻结的重新采样器输出中缺乏这些信息，但当重新采样器和分类器联合训练时，我们观察到显著的性能提升。这表明重新采样器所实现的压缩原则上可以编码必要的空间信息，但在预训练阶段需要更多的面向对象的目标来促进这种能力。

迷失在空间：探索视觉和语言重采样的细粒度空间理解