Large multimodal models such as Stable Diffusion can generate, detect, and classify new visual concepts after fine-tuning just a single word embedding. Do models learn similar words for the same concepts (i.e. <orange-cat> = orange + cat)? We conduct a large-scale analysis on three state-of-the-art models in text-to-image generation, open-set object detection, and zero-shot classification, and find that new word embeddings are model-specific and non-transferable. Across 4,800 new embeddings trained for 40 diverse visual concepts on four standard datasets, we find perturbations within an $\epsilon$-ball to any prior embedding that generate, detect, and classify an arbitrary concept. When these new embeddings are spliced into new models, fine-tuning that targets the original model is lost. We show popular soft prompt-tuning approaches find these perturbative solutions when applied to visual concept learning tasks, and embeddings for visual concepts are not transferable. Code for reproducing our work is available at: https://visual-words.github.io.

大型多模态模型通过仅微调单个词嵌入就能生成、检测和分类新的视觉概念，但我们发现模型学习相似的词语表示同一概念的能力是模型特定且不可转移的。我们对三种先进模型在文本到图像生成、开放集目标检测和零样本分类领域进行了大规模分析，发现新的词嵌入是模型特定且不可转移的。我们在四个标准数据集上针对40个不同的视觉概念训练了4800个新的嵌入，发现在一个epsilon球内的任何先前嵌入的扰动都能生成、检测和分类任意的概念。当这些新的词嵌入被插入新模型时，针对原始模型的微调将失效。我们展示了流行的软提示微调方法在视觉概念学习任务中发现这些扰动解，而视觉概念的嵌入是不可转移的。可复现我们工作的代码可在此https网址找到。

跨模型理解视觉概念