Large language models are few-shot learners that can solve diverse tasks from a handful of demonstrations. This implicit understanding of tasks suggests that the attention mechanisms over word tokens may play a role in analogical reasoning. In this work, we investigate whether analogical reasoning can enable in-context composition over composable elements of visual stimuli. First, we introduce a suite of three benchmarks to test the generalization properties of a visual in-context learner. We formalize the notion of an analogy-based in-context learner and use it to design a meta-learning framework called Im-Promptu. Whereas the requisite token granularity for language is well established, the appropriate compositional granularity for enabling in-context generalization in visual stimuli is usually unspecified. To this end, we use Im-Promptu to train multiple agents with different levels of compositionality, including vector representations, patch representations, and object slots. Our experiments reveal tradeoffs between extrapolation abilities and the degree of compositionality, with non-compositional representations extending learned composition rules to unseen domains but performing poorly on combinatorial tasks. Patch-based representations require patches to contain entire objects for robust extrapolation. At the same time, object-centric tokenizers coupled with a cross-attention module generate consistent and high-fidelity solutions, with these inductive biases being particularly crucial for compositional generalization. Lastly, we demonstrate a use case of Im-Promptu as an intuitive programming interface for image generation.

本研究探讨了模拟推理对于在可组合元素的视觉刺激下的情境组合的学习的作用，并提出了一个名为Im-Promptu的元学习框架，用于训练多个具有不同组成水平的代理。实验揭示了推广能力和组合度之间的权衡，可以扩展学习到的组合规则到看不见的域，但在组合任务上表现不佳。集中于对象的标记方法配合交叉注意模块生成一致和高保真的解决方案，这种感性偏见特别关键。最后，我们展示了Im-Promptu作为图像生成的直观编程界面的用例。

Im-Promptu: 基于图像提示的上下文组合