Can the latent spaces of modern generative neural rendering models serve as representations for 3D-aware discriminative visual understanding tasks? We use retrieval as a proxy for measuring the metric learning properties of the latent spaces of Shap-E, including capturing view-independence and enabling the aggregation of scene representations from the representations of individual image views, and find that Shap-E representations outperform those of the classical EfficientNet baseline representations zero-shot, and is still competitive when both methods are trained using a contrative loss. These findings give preliminary indication that 3D-based rendering and generative models can yield useful representations for discriminative tasks in our innately 3D-native world. Our code is available at \url{https://github.com/michaelwilliamtang/golden-retriever}.

本研究评估了现代生成式神经渲染模型的潜在空间是否可以作为具有三维感知的区分性视觉理解任务的表示，并使用检索作为度量学习属性的代理，发现 Shap-E 表示在零样本情况下优于经典 EfficientNet 基线表示，并且在使用对比损失训练两种方法时仍然具有竞争力。

渲染器是优秀的零样本表示学习器：探索扩散潜变量用于度量学习