In recent years, multimodal large language models (MLLMs) have shown
remarkable capabilities in tasks like visual question answering and common
sense reasoning, while visual perception models have made significant strides
in perception tasks, such as detection and segmentation. However, MLLMs mainly
focus on high-level image-text interpretations and struggle with fine-grained
visual understanding, and vision perception models usually suffer from
open-world distribution shifts due to their limited model capacity. To overcome
these challenges, we propose the Mutually Reinforced Multimodal Large Language
Model (MR-MLLM), a novel framework that synergistically enhances visual
perception and multimodal comprehension. First, a shared query fusion mechanism
is proposed to harmonize detailed visual inputs from vision models with the
linguistic depth of language models, enhancing multimodal comprehension and
vision perception synergistically. Second, we propose the perception-enhanced
cross-modal integration method, incorporating novel modalities from vision
perception outputs, like object detection bounding boxes, to capture subtle
visual elements, thus enriching the understanding of both visual and textual
data. In addition, an innovative perception-embedded prompt generation
mechanism is proposed to embed perceptual information into the language model's
prompts, aligning the responses contextually and perceptually for a more
accurate multimodal interpretation. Extensive experiments demonstrate MR-MLLM's
superior performance in various multimodal comprehension and vision perception
tasks, particularly those requiring corner case vision perception and
fine-grained language comprehension.

一个名为 Mutually Reinforced Multimodal Large Language Model (MR-MLLM) 的新框架，通过共享查询融合机制和增强的跨模态集成方法，结合视觉感知和多模态理解，以及混合了感知信息的提示生成机制，提供更准确的多模态解释，在各种多模态理解和视觉感知任务中展现卓越性能。

MR-MLLM: 多模态理解和视觉感知的相互增强

MR-MLLM: Mutual Reinforcement of Multimodal Comprehension and Vision  Perception

Developing visual perception models for active agents and sensorimotor
control are cumbersome to be done in the physical world, as existing algorithms
are too slow to efficiently learn in real-time and robots are fragile and
costly. This has given rise to learning-in-simulation which consequently casts
a question on whether the results transfer to real-world. In this paper, we are
concerned with the problem of developing real-world perception for active
agents, propose Gibson Virtual Environment for this purpose, and showcase
sample perceptual tasks learned therein. Gibson is based on virtualizing real
spaces, rather than using artificially designed ones, and currently includes
over 1400 floor spaces from 572 full buildings. The main characteristics of
Gibson are: I. being from the real-world and reflecting its semantic
complexity, II. having an internal synthesis mechanism, "Goggles", enabling
deploying the trained models in real-world without needing further domain
adaptation, III. embodiment of agents and making them subject to constraints of
physics and space.

为了解决在实际环境中开发视觉感知模型和感觉运动控制的困难和现有算法过慢，本文提出基于虚拟现实的 Gibson Virtual Environment，包含 1400 个真实环境，其中 572 个完整的建筑，其特点是可以提供真实环境的语义复杂性，并具有内部合成机制和代理的具身化机制使其遵守物理和空间的约束。

Gibson Env: 为身体化智能体提供真实世界感知

Gibson Env: Real-World Perception for Embodied Agents

State-of-the-art visual perception models for a wide range of tasks rely on
supervised pretraining. ImageNet classification is the de facto pretraining
task for these models. Yet, ImageNet is now nearly ten years old and is by
modern standards "small". Even so, relatively little is known about the
behavior of pretraining with datasets that are multiple orders of magnitude
larger. The reasons are obvious: such datasets are difficult to collect and
annotate. In this paper, we present a unique study of transfer learning with
large convolutional networks trained to predict hashtags on billions of social
media images. Our experiments demonstrate that training for large-scale hashtag
prediction leads to excellent results. We show improvements on several image
classification and object detection tasks, and report the highest ImageNet-1k
single-crop, top-1 accuracy to date: 85.4% (97.6% top-5). We also perform
extensive experiments that provide novel empirical data on the relationship
between large-scale pretraining and transfer learning performance.

本文探讨了利用大规模社交媒体图像预测 hashtag 的卷积神经网络进行的迁移学习的行为，并展示了相应的实验结果，证明进行大规模预训练能够显著提高图片分类和物体检测任务的表现。