Large Multimodal Model (LMM) is a hot research topic in the computer vision
area and has also demonstrated remarkable potential across multiple
disciplinary fields. A recent trend is to further extend and enhance the
perception capabilities of LMMs. The current methods follow the paradigm of
adapting the visual task outputs to the format of the language model, which is
the main component of a LMM. This adaptation leads to convenient development of
such LMMs with minimal modifications, however, it overlooks the intrinsic
characteristics of diverse visual tasks and hinders the learning of perception
capabilities. To address this issue, we propose a novel LMM architecture named
Lumen, a Large multimodal model with versatile vision-centric capability
enhancement. We decouple the LMM's learning of perception capabilities into
task-agnostic and task-specific stages. Lumen first promotes fine-grained
vision-language concept alignment, which is the fundamental capability for
various visual tasks. Thus the output of the task-agnostic stage is a shared
representation for all the tasks we address in this paper. Then the
task-specific decoding is carried out by flexibly routing the shared
representation to lightweight task decoders with negligible training efforts.
Benefiting from such a decoupled design, our Lumen surpasses existing LMM-based
approaches on the COCO detection benchmark with a clear margin and exhibits
seamless scalability to additional visual tasks. Furthermore, we also conduct
comprehensive ablation studies and generalization evaluations for deeper
insights. The code will be released at this https URL

大型多模态模型（LMM）是计算机视觉领域的热门研究课题，近期的趋势是进一步拓展和增强 LMM 的感知能力。我们提出了一种名为 Lumen 的新型 LMM 架构，将 LMM 的感知能力学习分解为任务无关和任务特定阶段，在 COCO 检测基准上显著超越现有的基于 LMM 的方法，并展现了对其他视觉任务的无缝可扩展性。

Lumen: 开放大型多模态模型的多样视觉能力

Lumen: Unleashing Versatile Vision-Centric Capabilities of Large  Multimodal Models

With the evolution of storage and communication protocols, ultra-low bitrate
image compression has become a highly demanding topic. However, existing
compression algorithms must sacrifice either consistency with the ground truth
or perceptual quality at ultra-low bitrate. In recent years, the rapid
development of the Large Multimodal Model (LMM) has made it possible to balance
these two goals. To solve this problem, this paper proposes a method called
Multimodal Image Semantic Compression (MISC), which consists of an LMM encoder
for extracting the semantic information of the image, a map encoder to locate
the region corresponding to the semantic, an image encoder generates an
extremely compressed bitstream, and a decoder reconstructs the image based on
the above information. Experimental results show that our proposed MISC is
suitable for compressing both traditional Natural Sense Images (NSIs) and
emerging AI-Generated Images (AIGIs) content. It can achieve optimal
consistency and perception results while saving 50% bitrate, which has strong
potential applications in the next generation of storage and communication. The
code will be released on this https URL

该研究提出了一种名为多模态图像语义压缩（MISC）的方法，采用大型多模态模型（LMM）来平衡传统自然感知图像和人工智能生成图像的压缩，实现了一致性和感知结果的优化，节省了 50％的比特率，并在存储和通信领域具有强大的应用潜力。

MISC：基于大型多模态模型驱动的超低比特率图像语义压缩

MISC: Ultra-low Bitrate Image Semantic Compression Driven by Large  Multimodal Model

Large Multimodal Model (LMM) GPT-4V(ision) endows GPT-4 with visual grounding
capabilities, making it possible to handle certain tasks through the Visual
Question Answering (VQA) paradigm. This paper explores the potential of
VQA-oriented GPT-4V in the recently popular visual Anomaly Detection (AD) and
is the first to conduct qualitative and quantitative evaluations on the popular
MVTec AD and VisA datasets. Considering that this task requires both
image-/pixel-level evaluations, the proposed GPT-4V-AD framework contains three
components: 1) Granular Region Division, 2) Prompt Designing, 3)
Text2Segmentation for easy quantitative evaluation, and have made some
different attempts for comparative analysis. The results show that GPT-4V can
achieve certain results in the zero-shot AD task through a VQA paradigm, such
as achieving image-level 77.1/88.0 and pixel-level 68.0/76.6 AU-ROCs on MVTec
AD and VisA datasets, respectively. However, its performance still has a
certain gap compared to the state-of-the-art zero-shot method, e.g., WinCLIP
ann CLIP-AD, and further research is needed. This study provides a baseline
reference for the research of VQA-oriented LMM in the zero-shot AD task, and we
also post several possible future works. Code is available at
https://github.com/zhangzjn/GPT-4V-AD.

GPT-4V-AD, a VQA-oriented framework utilizing the Large Multimodal Model (LMM) GPT-4V, shows promise in the zero-shot Anomaly Detection (AD) task, achieving certain results but with room for improvement compared to state-of-the-art methods.