We present an approach for analyzing grouping information contained within a neural network's activations, permitting extraction of spatial layout and semantic segmentation from the behavior of large pre-trained vision models. Unlike prior work, our method conducts a wholistic analysis of a network's activation state, leveraging features from all layers and obviating the need to guess which part of the model contains relevant information. Motivated by classic spectral clustering, we formulate this analysis in terms of an optimization objective involving a set of affinity matrices, each formed by comparing features within a different layer. Solving this optimization problem using gradient descent allows our technique to scale from single images to dataset-level analysis, including, in the latter, both intra- and inter-image relationships. Analyzing a pre-trained generative transformer provides insight into the computational strategy learned by such models. Equating affinity with key-query similarity across attention layers yields eigenvectors encoding scene spatial layout, whereas defining affinity by value vector similarity yields eigenvectors encoding object identity. This result suggests that key and query vectors coordinate attentional information flow according to spatial proximity (a `where' pathway), while value vectors refine a semantic category representation (a `what' pathway).

我们提出了一种从大规模预训练视觉模型的行为中提取空间布局和语义分割的方法，通过分析神经网络激活中的分组信息，利用所有层的特征实现对网络激活状态的整体分析，而无需猜测模型的哪个部分包含相关信息。通过基于梯度下降的优化目标在各个层中的特征比较得到一组亲和矩阵，从而解决了这一优化问题。对预训练的生成转换器进行分析揭示了这类模型所学得的计算策略，而通过将亲和性与关键字查询相似性等同起来，可以得到编码场景空间布局的特征向量，而通过将亲和性与值向量相似性定义为特征向量，则可以得到编码对象身份的特征向量。这个结果表明，关键字和查询向量根据空间接近度协调关注信息流（一种“在哪里”路径），而值向量则用于完善语义类别表示（一种“是什么”路径）。

从分层分布的神经表征的光谱聚类中解读'What'和'Where'视觉通路