Paparazzi：深入探究语言和视觉模型在观点描述中的能力

Feb, 2023

Paparazzi：深入探究语言和视觉模型在观点描述中的能力

Paparazzi: A Deep Dive into the Capabilities of Language and Vision Models for Grounding Viewpoint Descriptions

Henrik Voigt, Jan Hombeck, Monique Meuschke, Kai Lawonn, Sina Zarrieß

TL;DR本论文研究了CLIP模型在3D环境下对物体视角描述和识别中的表现以及对少量可用训练数据条件下的硬负采样和随机对比进行微调。

Abstract

Existing language and vision models achieve impressive performance in image-text understanding. Yet, it is an open question to what extent they can be used for language understanding in →