The Visible-Infrared Person Re-identification (VI ReID) aims to match visible
and infrared images of the same pedestrians across non-overlapped camera views.
These two input modalities contain both invariant information, such as shape,
and modality-specific details, such as color. An ideal model should utilize
valuable information from both modalities during training for enhanced
representational capability. However, the gap caused by modality-specific
information poses substantial challenges for the VI ReID model to handle
distinct modality inputs simultaneously. To address this, we introduce the
Modality-aware and Instance-aware Visual Prompts (MIP) network in our work,
designed to effectively utilize both invariant and specific information for
identification. Specifically, our MIP model is built on the transformer
architecture. In this model, we have designed a series of modality-specific
prompts, which could enable our model to adapt to and make use of the specific
information inherent in different modality inputs, thereby reducing the
interference caused by the modality gap and achieving better identification.
Besides, we also employ each pedestrian feature to construct a group of
instance-specific prompts. These customized prompts are responsible for guiding
our model to adapt to each pedestrian instance dynamically, thereby capturing
identity-level discriminative clues for identification. Through extensive
experiments on SYSU-MM01 and RegDB datasets, the effectiveness of both our
designed modules is evaluated. Additionally, our proposed MIP performs better
than most state-of-the-art methods.

可见 - 红外人员重新识别的关键是利用模态感知和实例感知的视觉提示网络，建立在 Transformer 架构上，利用模态特定提示和个体特定提示以提高鉴别能力，并在 SYSU-MM01 和 RegDB 数据集上验证了其有效性。

增强可见 - 红外人体重新识别：模态和实例感知视觉提示学习

Enhancing Visible-Infrared Person Re-identification with Modality- and  Instance-aware Visual Prompt Learning

For a long time, due to the high heterogeneity in structure and semantics
among various spatiotemporal modal data, the joint interpretation of multimodal
spatiotemporal data has been an extremely challenging problem. The primary
challenge resides in striking a trade-off between the cohesion and autonomy of
diverse modalities, and this trade-off exhibits a progressively nonlinear
nature as the number of modalities expands. We introduce the Language as
Reference Framework (LaRF), a fundamental principle for constructing a
multimodal unified model, aiming to strike a trade-off between the cohesion and
autonomy among different modalities. We propose a multimodal spatiotemporal
general artificial intelligence model, called AllSpark. Our model integrates
thirteen different modalities into a unified framework, including 1D (text,
code), 2D (RGB, infrared, SAR, multispectral, hyperspectral, tables, graphs,
trajectory, oblique photography), and 3D (point clouds, videos) modalities. To
achieve modal cohesion, AllSpark uniformly maps diverse modal features to the
language modality. In addition, we design modality-specific prompts to guide
multi-modal large language models in accurately perceiving multimodal data. To
maintain modality autonomy, AllSpark introduces modality-specific encoders to
extract the tokens of various spatiotemporal modalities. And modal bridge is
employed to achieve dimensional projection from each modality to the language
modality. Finally, observing a gap between the model's interpretation and
downstream tasks, we designed task heads to enhance the model's generalization
capability on specific downstream tasks. Experiments indicate that AllSpark
achieves competitive accuracy in modalities such as RGB and trajectory compared
to state-of-the-art models.

通过引入语言参考框架 (LaRF) 和 AllSpark 模型，将多模态时空数据的联合解释问题化为在各模态之间达成连贯性和自主性之间的权衡，并且实验结果表明 AllSpark 在 RGB 和轨迹等模态上相比最先进模型具有竞争力的准确度。