Recent multimodal large language models (MLLM) such as GPT-4o and GPT-4v have
shown great potential in autonomous driving. In this paper, we propose a
cross-domain few-shot in-context learning method based on the MLLM for
enhancing traffic sign recognition (TSR). We first construct a traffic sign
detection network based on Vision Transformer Adapter and an extraction module
to extract traffic signs from the original road images. To reduce the
dependence on training data and improve the performance stability of
cross-country TSR, we introduce a cross-domain few-shot in-context learning
method based on the MLLM. To enhance MLLM's fine-grained recognition ability of
traffic signs, the proposed method generates corresponding description texts
using template traffic signs. These description texts contain key information
about the shape, color, and composition of traffic signs, which can stimulate
the ability of MLLM to perceive fine-grained traffic sign categories. By using
the description texts, our method reduces the cross-domain differences between
template and real traffic signs. Our approach requires only simple and uniform
textual indications, without the need for large-scale traffic sign images and
labels. We perform comprehensive evaluations on the German traffic sign
recognition benchmark dataset, the Belgium traffic sign dataset, and two
real-world datasets taken from Japan. The experimental results show that our
method significantly enhances the TSR performance.

本研究基于多模态大型语言模型（MLLM）提出了一种跨域少样本上下文学习方法，用于增强交通标志识别（TSR）的性能，并通过生成相应描述文本来改善 MLLM 对交通标志的细粒度分类能力。实验结果表明，该方法显著提高了 TSR 的性能。

跨域少样本情境学习用于提升交通标志识别能力

Cross-domain Few-shot In-context Learning for Enhancing Traffic Sign  Recognition

The visual projector serves as an essential bridge between the visual encoder
and the Large Language Model (LLM) in a Multimodal LLM (MLLM). Typically, MLLMs
adopt a simple MLP to preserve all visual contexts via one-to-one
transformation. However, the visual tokens are redundant and can be
considerably increased when dealing with high-resolution images, impairing the
efficiency of MLLMs significantly. Some recent works have introduced resampler
or abstractor to reduce the number of resulting visual tokens. Unfortunately,
they fail to capture finer details and undermine the visual reasoning
capabilities of MLLMs. In this work, we propose a novel visual projector, which
adopts a coarse-to-fine scheme to inject the enriched characteristics to
generate the condensed visual tokens. In specific, we first interpolate the
visual features as a low-resolution point query, providing the overall visual
representation as the foundation. Then, we introduce a region-to-point
injection module that utilizes high-resolution, multi-level region-based cues
as fine-grained reference keys and values, allowing them to be fully absorbed
within the corresponding local context region. This step effectively updates
the coarse point query, transforming it into an enriched one for the subsequent
LLM reasoning. Extensive experiments demonstrate that our approach compresses
the visual tokens by 75%~89%, while achieves comparable or even better
performance across diverse benchmarks with significantly higher efficiency. The
source codes can be found at this https URL

我们提出了一种新的视觉投影仪，采用粗细方案，通过注入丰富的特征生成压缩的视觉标记，并实现了更高的效率。

TokenPacker: 多模态 LLM 的高效视觉投影器

TokenPacker: Efficient Visual Projector for Multimodal LLM

In the rapidly advancing field of conditional image generation research,
challenges such as limited explainability lie in effectively evaluating the
performance and capabilities of various models. This paper introduces VIESCORE,
a Visual Instruction-guided Explainable metric for evaluating any conditional
image generation tasks. VIESCORE leverages general knowledge from Multimodal
Large Language Models (MLLMs) as the backbone and does not require training or
fine-tuning. We evaluate VIESCORE on seven prominent tasks in conditional image
tasks and found: (1) VIESCORE (GPT4-v) achieves a high Spearman correlation of
0.3 with human evaluations, while the human-to-human correlation is 0.45. (2)
VIESCORE (with open-source MLLM) is significantly weaker than GPT-4v in
evaluating synthetic images. (3) VIESCORE achieves a correlation on par with
human ratings in the generation tasks but struggles in editing tasks. With
these results, we believe VIESCORE shows its great potential to replace human
judges in evaluating image synthesis tasks.

本文介绍了 VIESCORE，这是一种视觉指导的可解释度度量指标，用于评估任何条件图像生成任务。VIESCORE 利用多模态大语言模型（MLLMs）的通用知识作为支撑，无需训练或微调。在七项著名的条件图像任务上评估 VIESCORE，我们发现：（1）VIESCORE（GPT4-v）与人类评估的 Spearman 相关系数达到了 0.3，而人类之间的相关系数为 0.45。（2）与 GPT-4v 相比，使用开源 MLLM 的 VIESCORE 在评估合成图像时明显较弱。（3）VIESCORE 在生成任务中与人类评分具有相当的相关性，但在编辑任务中存在困难。基于这些结果，我们相信 VIESCORE 在评估图像合成任务中展现了巨大的潜力，可以取代人类评委的角色。

VIEScore：面向条件图像合成评估的可解释度量

VIEScore: Towards Explainable Metrics for Conditional Image Synthesis  Evaluation

Text-rich VQA, namely Visual Question Answering based on text recognition in
the images, is a cross-modal task that requires both image comprehension and
text recognition. In this work, we focus on investigating the advantages and
bottlenecks of LLM-based approaches in addressing this problem. To address the
above concern, we separate the vision and language modules, where we leverage
external OCR models to recognize texts in the image and Large Language Models
(LLMs) to answer the question given texts. The whole framework is training-free
benefiting from the in-context ability of LLMs. This pipeline achieved superior
performance compared to the majority of existing Multimodal Large Language
Models (MLLM) on four text-rich VQA datasets. Besides, based on the ablation
study, we find that LLM brings stronger comprehension ability and may introduce
helpful knowledge for the VQA problem. The bottleneck for LLM to address
text-rich VQA problems may primarily lie in visual part. We also combine the
OCR module with MLLMs and pleasantly find that the combination of OCR module
with MLLM also works. It's worth noting that not all MLLMs can comprehend the
OCR information, which provides insights into how to train an MLLM that
preserves the abilities of LLM.

基于文本识别的图像视觉问答是一个跨模态任务，需要图像理解和文本识别。本文研究了基于 LLM 方法在解决此问题时的优势和瓶颈，并通过整合 OCR 模块和 MLLM 发现多数 MLLM 可以理解 OCR 信息，为训练保留 LLM 能力提供了启示。