Large scale vision and language models can achieve impressive zero-shot recognition performance by mapping class specific text queries to image content. Two distinct challenges that remain however, are high sensitivity to the choice of handcrafted class names that define queries, and the difficulty of adaptation to new, smaller datasets. Towards addressing these problems, we propose to leverage available data to learn, for each class, an optimal word embedding as a function of the visual content. By learning new word embeddings on an otherwise frozen model, we are able to retain zero-shot capabilities for new classes, easily adapt models to new datasets, and adjust potentially erroneous, non-descriptive or ambiguous class names. We show that our solution can easily be integrated in image classification and object detection pipelines, yields significant performance gains in multiple scenarios and provides insights into model biases and labelling errors.

使用可用数据为每个类学习最佳词嵌入作为视觉内容的函数，以此来解决零样本识别对手工类名的高度敏感以及适应新、较小数据集的困难。我们证明，该解决方案可以轻松集成在图像分类和物体检测管道中，在多种情况下产生显著的性能增益，并提供模型偏差和标注误差的见解。

为视觉和语言模型命名类别的学习