Large Multimodal Models (LMMs) have shown remarkable capabilities across a
variety of tasks (e.g., image captioning, visual question answering). While
broad, their knowledge remains generic (e.g., recognizing a dog), and they are
unable to handle personalized subjects (e.g., recognizing a user's pet dog).
Human reasoning, in contrast, typically operates within the context of specific
subjects in our surroundings. For example, one might ask, "What should I buy
for my dog's birthday?"; as opposed to a generic inquiry about "What should I
buy for a dog's birthday?". Similarly, when looking at a friend's image, the
interest lies in seeing their activities (e.g., "my friend is holding a cat"),
rather than merely observing generic human actions (e.g., "a man is holding a
cat"). In this paper, we introduce the novel task of personalizing LMMs, so
that they can have conversations about a specific subject. We propose Yo'LLaVA,
which learns to embed a personalized subject into a set of latent tokens given
a handful of example images of the subject. Our qualitative and quantitative
analyses reveal that Yo'LLaVA can learn the concept more efficiently using
fewer tokens and more effectively encode the visual attributes compared to
strong prompting baselines (e.g., LLaVA).

本文介绍了将个性化主题嵌入到一组潜在令牌中的 Yo'LLaVA 方法，通过少量示例图像有效地学习并更有效地编码视觉属性，用于实现 Large Multimodal Models（LMMs）与特定主题的对话。

Yo'LLaVA: 个性化语言和视觉助手

Yo'LLaVA: Your Personalized Language and Vision Assistant

Vision transformers (ViTs) have achieved impressive results on various
computer vision tasks in the last several years. In this work, we study the
capability of frozen ViTs, pretrained only on visual data, to generalize to
audio-visual data without finetuning any of its original parameters. To do so,
we propose a latent audio-visual hybrid (LAVISH) adapter that adapts pretrained
ViTs to audio-visual tasks by injecting a small number of trainable parameters
into every layer of a frozen ViT. To efficiently fuse visual and audio cues,
our LAVISH adapter uses a small set of latent tokens, which form an attention
bottleneck, thus, eliminating the quadratic cost of standard cross-attention.
Compared to the existing modality-specific audio-visual methods, our approach
achieves competitive or even better performance on various audio-visual tasks
while using fewer tunable parameters and without relying on costly audio
pretraining or external audio encoders. Our code is available at
this https URL

本文研究冻结的视觉 transformers 模型的能力与使用 LAVISH 适配器对其应用到视听任务的可行性，结果显示此方法获得了很好的效果。