Visible-infrared person re-identification (VIReID) primarily deals with
matching identities across person images from different modalities. Due to the
modality gap between visible and infrared images, cross-modality identity
matching poses significant challenges. Recognizing that high-level semantics of
pedestrian appearance, such as gender, shape, and clothing style, remain
consistent across modalities, this paper intends to bridge the modality gap by
infusing visual features with high-level semantics. Given the capability of
CLIP to sense high-level semantic information corresponding to visual
representations, we explore the application of CLIP within the domain of
VIReID. Consequently, we propose a CLIP-Driven Semantic Discovery Network
(CSDN) that consists of Modality-specific Prompt Learner, Semantic Information
Integration (SII), and High-level Semantic Embedding (HSE). Specifically,
considering the diversity stemming from modality discrepancies in language
descriptions, we devise bimodal learnable text tokens to capture
modality-private semantic information for visible and infrared images,
respectively. Additionally, acknowledging the complementary nature of semantic
details across different modalities, we integrate text features from the
bimodal language descriptions to achieve comprehensive semantics. Finally, we
establish a connection between the integrated text features and the visual
features across modalities. This process embed rich high-level semantic
information into visual representations, thereby promoting the modality
invariance of visual representations. The effectiveness and superiority of our
proposed CSDN over existing methods have been substantiated through
experimental evaluations on multiple widely used benchmarks. The code will be
released at https://github.com/nengdong96/CSDN.

可见 - 红外人员再识别（VIReID）主要处理来自不同模态的人员图像之间的身份匹配，并通过融合高级语义与视觉特征来弥合模态差距。我们提出了一个基于 CLIP 的语义发现网络（CSDN），通过多模态学习的文本标记和集成文本特征来嵌入丰富的高级语义信息，从而促进了视觉特征的模态不变性。在多个常用基准测试上的实验评估证实了我们提出的 CSDN 方法的有效性和优越性。

基于 CLIP 的可见光 - 红外人员再识别的语义发现网络

CLIP-Driven Semantic Discovery Network for Visible-Infrared Person  Re-Identification

Most classification models treat different object classes in parallel and the
misclassifications between any two classes are treated equally. In contrast,
human beings can exploit high-level information in making a prediction of an
unknown object. Inspired by this observation, the paper proposes a super-class
guided network (SGNet) to integrate the high-level semantic information into
the network so as to increase its performance in inference. SGNet takes
two-level class annotations that contain both super-class and finer class
labels. The super-classes are higher-level semantic categories that consist of
a certain amount of finer classes. A super-class branch (SCB), trained on
super-class labels, is introduced to guide finer class prediction. At the
inference time, we adopt two different strategies: Two-step inference (TSI) and
direct inference (DI). TSI first predicts the super-class and then makes
predictions of the corresponding finer class. On the other hand, DI directly
generates predictions from the finer class branch (FCB). Extensive experiments
have been performed on CIFAR-100 and MS COCO datasets. The experimental results
validate the proposed approach and demonstrate its superior performance on
image classification and object detection.

本文提出了一种基于超类引导网络的图像分类与目标检测模型，通过引入高级语义信息优化模型的性能，该模型采用两级类别标注包含了超类别和细分类别，使用两种不同的推断策略来预测图像的类别，实验证明了该方法在 CIFAR-100 和 MS COCO 数据集上具有卓越的性能表现。

SGNet：一种用于图像分类和物体检测的超类引导网络

SGNet: A Super-class Guided Network for Image Classification and Object  Detection

Predicting where people look in natural scenes has attracted a lot of
interest in computer vision and computational neuroscience over the past two
decades. Two seemingly contrasting categories of cues have been proposed to
influence where people look: \textit{low-level image saliency} and
\textit{high-level semantic information}. Our first contribution is to take a
detailed look at these cues to confirm the hypothesis proposed by
Henderson~\cite{henderson1993eye} and Nuthmann \&
Henderson~\cite{nuthmann2010object} that observers tend to look at the center
of objects. We analyzed fixation data for scene free-viewing over 17 observers
on 60 fully annotated images with various types of objects. Images contained
different types of scenes, such as natural scenes, line drawings, and 3D
rendered scenes. Our second contribution is to propose a simple combined model
of low-level saliency and object center-bias that outperforms each individual
component significantly over our data, as well as on the OSIE dataset by Xu et
al.~\cite{xu2014predicting}. The results reconcile saliency with object
center-bias hypotheses and highlight that both types of cues are important in
guiding fixations. Our work opens new directions to understand strategies that
humans use in observing scenes and objects, and demonstrates the construction
of combined models of low-level saliency and high-level object-based
information.

研究自然场景中人们的注意力转移，在低层图像显著性和高层语义信息等方面提出了两个看似相互矛盾的提示，其中分析了物体中心偏置的影响，并提出了一种结合低层视觉显著性和物体中心偏置的模型，旨在深入了解人类在观察场景和对象时使用的策略，并展示结合低层视觉显著性和高层物体信息的模型的构建。