The visual projector serves as an essential bridge between the visual encoder
and the Large Language Model (LLM) in a Multimodal LLM (MLLM). Typically, MLLMs
adopt a simple MLP to preserve all visual contexts via one-to-one
transformation. However, the visual tokens are redundant and can be
considerably increased when dealing with high-resolution images, impairing the
efficiency of MLLMs significantly. Some recent works have introduced resampler
or abstractor to reduce the number of resulting visual tokens. Unfortunately,
they fail to capture finer details and undermine the visual reasoning
capabilities of MLLMs. In this work, we propose a novel visual projector, which
adopts a coarse-to-fine scheme to inject the enriched characteristics to
generate the condensed visual tokens. In specific, we first interpolate the
visual features as a low-resolution point query, providing the overall visual
representation as the foundation. Then, we introduce a region-to-point
injection module that utilizes high-resolution, multi-level region-based cues
as fine-grained reference keys and values, allowing them to be fully absorbed
within the corresponding local context region. This step effectively updates
the coarse point query, transforming it into an enriched one for the subsequent
LLM reasoning. Extensive experiments demonstrate that our approach compresses
the visual tokens by 75%~89%, while achieves comparable or even better
performance across diverse benchmarks with significantly higher efficiency. The
source codes can be found at this https URL

我们提出了一种新的视觉投影仪，采用粗细方案，通过注入丰富的特征生成压缩的视觉标记，并实现了更高的效率。

TokenPacker: 多模态 LLM 的高效视觉投影器

TokenPacker: Efficient Visual Projector for Multimodal LLM

Recent advances in large language models (LLMs) have promoted generative
error correction (GER) for automatic speech recognition (ASR), which aims to
predict the ground-truth transcription from the decoded N-best hypotheses.
Thanks to the strong language generation ability of LLMs and rich information
in the N-best list, GER shows great effectiveness in enhancing ASR results.
However, it still suffers from two limitations: 1) LLMs are unaware of the
source speech during GER, which may lead to results that are grammatically
correct but violate the source speech content, 2) N-best hypotheses usually
only vary in a few tokens, making it redundant to send all of them for GER,
which could confuse LLM about which tokens to focus on and thus lead to
increased miscorrection. In this paper, we propose ClozeGER, a new paradigm for
ASR generative error correction. First, we introduce a multimodal LLM (i.e.,
SpeechGPT) to receive source speech as extra input to improve the fidelity of
correction output. Then, we reformat GER as a cloze test with logits
calibration to remove the input information redundancy and simplify GER with
clear instructions. Experiments show that ClozeGER achieves a new breakthrough
over vanilla GER on 9 popular ASR datasets.

该论文提出了一种新的 ASR 生成性错误纠正范式 ClozeGER，通过引入一种多模态 LLM（即 SpeechGPT）来改善纠正输出的忠实度，然后将 GER 重新设计为带有 logits 校准的 cloze 测试，以消除输入信息冗余并简化 GER 过程。实验证明，ClozeGER 在 9 个流行的 ASR 数据集上取得了新的突破。

再次聆听并选择正确答案：大语言模型下自动语音识别的新范式

Listen Again and Choose the Right Answer: A New Paradigm for Automatic  Speech Recognition with Large Language Models

We show that LLMs hallucinate because their output is not constrained to be
synonymous with claims for which they have evidence: a condition that we call
evidential closure. Information about the truth or falsity of sentences is not
statistically identified in the standard neural probabilistic language model
setup, and so cannot be conditioned on to generate new strings. We then show
how to constrain LLMs to produce output that does satisfy evidential closure. A
multimodal LLM must learn about the external world (perceptual learning); it
must learn a mapping from strings to states of the world (extensional
learning); and, to achieve fluency when generalizing beyond a body of evidence,
it must learn mappings from strings to their synonyms (intensional learning).
The output of a unimodal LLM must be synonymous with strings in a validated
evidence set. Finally, we present a heuristic procedure, Learn-Babble-Prune,
that yields faithful output from an LLM by rejecting output that is not
synonymous with claims for which the LLM has evidence.

我们展示了 LLMs 的幻觉现象是因为它们的输出没有限制为与它们有证据支持的论点同义词，这一情况被称为证据闭合。我们进一步展示了如何约束 LLMs 以产生符合证据闭合的输出，并引入了多模式 LLMs、学习 - 胡言乱语 - 修剪（Learn-Babble-Prune）的启发式过程以确保 LLMs 输出与其有证据支持的论点同义。

语言模型为何产生幻觉，如何获得（证据性）凝聚：忠实自然语言生成的感知、意图和推广学习

Why LLMs Hallucinate, and How to Get (Evidential) Closure: Perceptual,  Intensional, and Extensional Learning for Faithful Natural Language  Generation

We present Any-Modality Augmented Language Model (AnyMAL), a unified model
that reasons over diverse input modality signals (i.e. text, image, video,
audio, IMU motion sensor), and generates textual responses. AnyMAL inherits the
powerful text-based reasoning abilities of the state-of-the-art LLMs including
LLaMA-2 (70B), and converts modality-specific signals to the joint textual
space through a pre-trained aligner module. To further strengthen the
multimodal LLM's capabilities, we fine-tune the model with a multimodal
instruction set manually collected to cover diverse topics and tasks beyond
simple QAs. We conduct comprehensive empirical analysis comprising both human
and automatic evaluations, and demonstrate state-of-the-art performance on
various multimodal tasks.

我们提出了 Any-Modality Augmented Language Model (AnyMAL)，这是一个统一模型，可以对多样化的输入模态信号（文本、图像、视频、音频、IMU 运动传感器）进行推理，并生成文本回复。AnyMAL 继承了最先进的 LLMs（如 LLaMA-2 (70B)）的强大的基于文本的推理能力，并通过预训练的对齐模块将模态特定信号转换为联合文本空间。为了进一步增强多模态 LLM 的功能，我们使用人工收集的多模态指令集对模型进行了微调，以覆盖复杂的主题和任务。我们进行了全面的经验分析，包括人工和自动评估，并在各种多模态任务上展示了最先进的性能。