This paper, for the first time, explores text-to-image diffusion models for
Zero-Shot Sketch-based Image Retrieval (ZS-SBIR). We highlight a pivotal
discovery: the capacity of text-to-image diffusion models to seamlessly bridge
the gap between sketches and photos. This proficiency is underpinned by their
robust cross-modal capabilities and shape bias, findings that are substantiated
through our pilot studies. In order to harness pre-trained diffusion models
effectively, we introduce a straightforward yet powerful strategy focused on
two key aspects: selecting optimal feature layers and utilising visual and
textual prompts. For the former, we identify which layers are most enriched
with information and are best suited for the specific retrieval requirements
(category-level or fine-grained). Then we employ visual and textual prompts to
guide the model's feature extraction process, enabling it to generate more
discriminative and contextually relevant cross-modal representations. Extensive
experiments on several benchmark datasets validate significant performance
improvements.

这篇论文首次探索了用于零样本基于草图的图像检索的文本到图像扩散模型，发现其能够无缝地弥合草图与照片之间的差距，利用交叉模态能力和形状倾向性，通过我们的初步研究得到验证。为了有效利用预训练的扩散模型，我们引入了一种简单而有效的策略，着重于两个关键方面：选择最佳特征层和利用视觉和文本提示。通过识别最丰富信息且最适合特定检索要求（分类级别或细粒度）的层，然后使用视觉和文本提示来引导模型的特征提取过程，使其生成更具辨别力和相关上下文的交叉模态表示。在几个基准数据集上进行的大量实验证实了显著的性能提升。

文本到图像扩散模型是优秀的素描照片匹配工具

Text-to-Image Diffusion Models are Great Sketch-Photo Matchmakers

In this work, we extend the instruction-tuned Llama-2 model with end-to-end
general-purpose speech processing and reasoning abilities while maintaining the
wide range of LLM capabilities, without using any carefully curated paired
data. The proposed model can utilize audio prompts as a replacement for text
and sustain a conversation. Such a model also has extended cross-modal
capabilities such as being able to perform speech question answering, speech
translation, and audio summarization amongst many other closed and open-domain
tasks. This is unlike prior approaches in speech, in which LLMs are extended to
handle audio for a limited number of pre-designated tasks. Experiments show
that our end-to-end approach is on par with or outperforms a cascaded system
(speech recognizer + LLM) in terms of modeling the response to a prompt.
Furthermore, unlike a cascade, our approach shows the ability to interchange
text and audio modalities and utilize the prior context in a conversation to
provide better results.

通过扩展 instruction-tuned Llama-2 模型的同时保持 LLM 的广泛能力范围，本研究提出了一种具备端到端通用语音处理和推理能力的模型。该模型可以使用音频提示代替文本进行对话，并且具备跨模态的能力，例如语音问答、语音翻译和音频摘要等。通过实验，我们证明了这种端到端的方法在建模回应时与或优于级联系统（语音识别器 + LLM），并且可以更好地利用对话中的先前上下文提供更好的结果。