Existing pedestrian attribute recognition (PAR) algorithms adopt pre-trained
CNN (e.g., ResNet) as their backbone network for visual feature learning, which
might obtain sub-optimal results due to the insufficient employment of the
relations between pedestrian images and attribute labels. In this paper, we
formulate PAR as a vision-language fusion problem and fully exploit the
relations between pedestrian images and attribute labels. Specifically, the
attribute phrases are first expanded into sentences, and then the pre-trained
vision-language model CLIP is adopted as our backbone for feature embedding of
visual images and attribute descriptions. The contrastive learning objective
connects the vision and language modalities well in the CLIP-based feature
space, and the Transformer layers used in CLIP can capture the long-range
relations between pixels. Then, a multi-modal Transformer is adopted to fuse
the dual features effectively and feed-forward network is used to predict
attributes. To optimize our network efficiently, we propose the region-aware
prompt tuning technique to adjust very few parameters (i.e., only the prompt
vectors and classification heads) and fix both the pre-trained VL model and
multi-modal Transformer. Our proposed PAR algorithm only adjusts 0.75%
learnable parameters compared with the fine-tuning strategy. It also achieves
new state-of-the-art performance on both standard and zero-shot settings for
PAR, including RAPv1, RAPv2, WIDER, PA100K, and PETA-ZS, RAP-ZS datasets. The
source code and pre-trained models will be released on
this https URL

将行人属性识别问题（PAR）构建为视觉语言融合问题，充分利用行人图像与属性标签之间的关系，在特征嵌入方面采用预训练的视觉 - 语言模型 CLIP 作为骨干网络，并通过对比学习目标和 Transformer 层来捕捉像素之间的远程关系，最后采用多模态 Transformer 有效地融合双重特征并使用前馈网络来预测属性。该算法在 PAR 领域中取得了最新的最优结果。

基于 CLIP 的提示视觉语言融合的行人属性识别

Pedestrian Attribute Recognition via CLIP based Prompt Vision-Language  Fusion

Existing pedestrian attribute recognition (PAR) algorithms are mainly
developed based on a static image. However, the performance is not reliable for
images with challenging factors, such as heavy occlusion, motion blur, etc. In
this work, we propose to understand human attributes using video frames that
can make full use of temporal information. Specifically, we formulate the
video-based PAR as a vision-language fusion problem and adopt pre-trained big
models CLIP to extract the feature embeddings of given video frames. To better
utilize the semantic information, we take the attribute list as another input
and transform the attribute words/phrase into the corresponding sentence via
split, expand, and prompt. Then, the text encoder of CLIP is utilized for
language embedding. The averaged visual tokens and text tokens are concatenated
and fed into a fusion Transformer for multi-modal interactive learning. The
enhanced tokens will be fed into a classification head for pedestrian attribute
prediction. Extensive experiments on a large-scale video-based PAR dataset
fully validated the effectiveness of our proposed framework.

本研究提出了一种基于视频帧的行人属性识别方法，将视觉和语言信息融合，使用 CLIP 模型进行特征提取和语言嵌入，通过多模态交互学习实现行人属性预测。

基于 CLIP 引导的视觉 - 文本融合变压器的视频行人属性识别学习

Learning CLIP Guided Visual-Text Fusion Transformer for Video-based  Pedestrian Attribute Recognition

Learning to fuse vision and language information and representing them is an
important research problem with many applications. Recent progresses have
leveraged the ideas of pre-training (from language modeling) and attention
layers in Transformers to learn representation from datasets containing images
aligned with linguistic expressions that describe the images. In this paper, we
propose learning representations from a set of implied, visually grounded
expressions between image and text, automatically mined from those datasets. In
particular, we use denotation graphs to represent how specific concepts (such
as sentences describing images) can be linked to abstract and generic concepts
(such as short phrases) that are also visually grounded. This type of
generic-to-specific relations can be discovered using linguistic analysis
tools. We propose methods to incorporate such relations into learning
representation. We show that state-of-the-art multimodal learning models can be
further improved by leveraging automatically harvested structural relations.
The representations lead to stronger empirical results on downstream tasks of
cross-modal image retrieval, referring expression, and compositional
attribute-object recognition. Both our codes and the extracted denotation
graphs on the Flickr30K and the COCO datasets are publically available on
this https URL

本文提出利用暗示的视觉引导表达学习表示，自动地从图像和文本的数据集中挖掘出的结构关系，用于多模态学习模型中的视觉语言融合任务，证明了该方法在跨模态图片检索、指代表达和组合属性对象识别中的有效性。