In recent years, Large Language Models (LLMs) have garnered significant
attention from the research community due to their exceptional performance and
generalization capabilities. In this paper, we introduce a novel method for
contextualizing speech recognition models incorporating LLMs. Our approach
casts speech recognition as a mixed-modal language modeling task based on a
pretrained LLM. We provide audio features, along with optional text tokens for
context, to train the system to complete transcriptions in a decoder-only
fashion. As a result, the system is implicitly incentivized to learn how to
leverage unstructured contextual information during training. Our empirical
results demonstrate a significant improvement in performance, with a 6% WER
reduction when additional textual context is provided. Moreover, we find that
our method performs competitively and improve by 7.5% WER overall and 17% WER
on rare words against a baseline contextualized RNN-T system that has been
trained on more than twenty five times larger speech dataset. Overall, we
demonstrate that by only adding a handful number of trainable parameters via
adapters, we can unlock contextualized speech recognition capability for the
pretrained LLM while keeping the same text-only input functionality.

通过引入一种新方法，结合大型语言模型（LLMs）来进行上下文化的语音识别模型，我们证明通过添加适配器的少量可训练参数，可以在保持相同的文本输入功能的同时，实现预训练 LLM 的上下文化语音识别能力并显著提高性能。

利用大型语言模型进行端到端语音识别的语境化

End-to-End Speech Recognition Contextualization with Large Language  Models

Pre-trained vision-language models like CLIP have recently shown superior
performances on various downstream tasks, including image classification and
segmentation. However, in fine-grained image re-identification (ReID), the
labels are indexes, lacking concrete text descriptions. Therefore, it remains
to be determined how such models could be applied to these tasks. This paper
first finds out that simply fine-tuning the visual model initialized by the
image encoder in CLIP, has already obtained competitive performances in various
ReID tasks. Then we propose a two-stage strategy to facilitate a better visual
representation. The key idea is to fully exploit the cross-modal description
ability in CLIP through a set of learnable text tokens for each ID and give
them to the text encoder to form ambiguous descriptions. In the first training
stage, image and text encoders from CLIP keep fixed, and only the text tokens
are optimized from scratch by the contrastive loss computed within a batch. In
the second stage, the ID-specific text tokens and their encoder become static,
providing constraints for fine-tuning the image encoder. With the help of the
designed loss in the downstream task, the image encoder is able to represent
data as vectors in the feature embedding accurately. The effectiveness of the
proposed strategy is validated on several datasets for the person or vehicle
ReID tasks. Code is available at this https URL

本文提出了一种利用 CLIP 模型的文本 - 图像交互能力来解决细粒度图像重识别问题的方法，通过对学习的文本编码器给出模糊的文本描述来增强视觉表示，并通过一系列基于对比度损失的优化训练来优化文本令牌。

CLIP-ReID: 充分利用视觉 - 语言模型进行图像重新识别，无需具体文本标签

CLIP-ReID: Exploiting Vision-Language Model for Image Re-Identification  without Concrete Text Labels

Multi-channel video-language retrieval require models to understand
information from different channels (e.g. video$+$question, video$+$speech) to
correctly link a video with a textual response or query. Fortunately,
contrastive multimodal models are shown to be highly effective at aligning
entities in images/videos and text, e.g., CLIP; text contrastive models are
extensively studied recently for their strong ability of producing
discriminative sentence embeddings, e.g., SimCSE. However, there is not a clear
way to quickly adapt these two lines to multi-channel video-language retrieval
with limited data and resources. In this paper, we identify a principled model
design space with two axes: how to represent videos and how to fuse video and
text information. Based on categorization of recent methods, we investigate the
options of representing videos using continuous feature vectors or discrete
text tokens; for the fusion method, we explore the use of a multimodal
transformer or a pretrained contrastive text model. We extensively evaluate the
four combinations on five video-language datasets. We surprisingly find that
discrete text tokens coupled with a pretrained contrastive text model yields
the best performance, which can even outperform state-of-the-art on the iVQA
and How2QA datasets without additional training on millions of video-text data.
Further analysis shows that this is because representing videos as text tokens
captures the key visual information and text tokens are naturally aligned with
text models that are strong retrievers after the contrastive pretraining
process. All the empirical analysis establishes a solid foundation for future
research on affordable and upgradable multimodal intelligence.

探索多模态检索中利用预训练对比模型和文本符号融合信息的最佳方式，并发现用离散文本符号表示视频的方法取得最佳效果。