Text-guided image generation aimed to generate desired images conditioned on
given texts, while text-guided image manipulation refers to semantically edit
parts of a given image based on specified texts. For these two similar tasks,
the key point is to ensure image fidelity as well as semantic consistency. Many
previous approaches require complex multi-stage generation and adversarial
training, while struggling to provide a unified framework for both tasks. In
this work, we propose TextCLIP, a unified framework for text-guided image
generation and manipulation without adversarial training. The proposed method
accepts input from images or random noise corresponding to these two different
tasks, and under the condition of the specific texts, a carefully designed
mapping network that exploits the powerful generative capabilities of StyleGAN
and the text image representation capabilities of Contrastive Language-Image
Pre-training (CLIP) generates images of up to $1024\times1024$ resolution that
can currently be generated. Extensive experiments on the Multi-modal CelebA-HQ
dataset have demonstrated that our proposed method outperforms existing
state-of-the-art methods, both on text-guided generation tasks and manipulation
tasks.

提出了 TextCLIP，这是一个统一的框架，用于无对抗训练的文本引导的图像生成和操作，通过对 Contrastive Language-Image Pre-training (CLIP) 的文本图像表示能力和 StyleGAN 的生成能力的结合，能够生成高达 1024×1024 分辨率的图像，并在 Multi-modal CelebA-HQ 数据集上取得了优于现有最先进方法的结果。

TextCLIP：无对抗训练的文本指导人脸图像生成与操作

TextCLIP: Text-Guided Face Image Generation And Manipulation Without  Adversarial Training

We introduce caption-guided face recognition (CGFR) as a new framework to
improve the performance of commercial-off-the-shelf (COTS) face recognition
(FR) systems. In contrast to combining soft biometrics (eg., facial marks,
gender, and age) with face images, in this work, we use facial descriptions
provided by face examiners as a piece of auxiliary information. However, due to
the heterogeneity of the modalities, improving the performance by directly
fusing the textual and facial features is very challenging, as both lie in
different embedding spaces. In this paper, we propose a contextual feature
aggregation module (CFAM) that addresses this issue by effectively exploiting
the fine-grained word-region interaction and global image-caption association.
Specifically, CFAM adopts a self-attention and a cross-attention scheme for
improving the intra-modality and inter-modality relationship between the image
and textual features, respectively. Additionally, we design a textual feature
refinement module (TFRM) that refines the textual features of the pre-trained
BERT encoder by updating the contextual embeddings. This module enhances the
discriminative power of textual features with a cross-modal projection loss and
realigns the word and caption embeddings with visual features by incorporating
a visual-semantic alignment loss. We implemented the proposed CGFR framework on
two face recognition models (ArcFace and AdaFace) and evaluated its performance
on the Multi-Modal CelebA-HQ dataset. Our framework significantly improves the
performance of ArcFace in both 1:1 verification and 1:N identification
protocol.

引入了一种基于描述指导的人脸识别（CGFR）框架来提高商品化人脸识别系统（COTS FR）性能，通过引入面部描述信息作为辅助信息来改善性能，通过使用上下文特征聚合模块（CFAM）和文本特征细化模块（TFRM）来有效地处理文本和面部特征间的异构性，显著提高了 ArcFace 在多模态 CelebA-HQ 数据集上的验证和识别性能。

利用多细粒度环境特征聚合提高基于标题监督的人脸识别

Improving Face Recognition from Caption Supervision with Multi-Granular  Contextual Feature Aggregation

In this work, we propose TediGAN, a novel framework for multi-modal image
generation and manipulation with textual descriptions. The proposed method
consists of three components: StyleGAN inversion module, visual-linguistic
similarity learning, and instance-level optimization. The inversion module maps
real images to the latent space of a well-trained StyleGAN. The
visual-linguistic similarity learns the text-image matching by mapping the
image and text into a common embedding space. The instance-level optimization
is for identity preservation in manipulation. Our model can produce diverse and
high-quality images with an unprecedented resolution at 1024. Using a control
mechanism based on style-mixing, our TediGAN inherently supports image
synthesis with multi-modal inputs, such as sketches or semantic labels, with or
without instance guidance. To facilitate text-guided multi-modal synthesis, we
propose the Multi-Modal CelebA-HQ, a large-scale dataset consisting of real
face images and corresponding semantic segmentation map, sketch, and textual
descriptions. Extensive experiments on the introduced dataset demonstrate the
superior performance of our proposed method. Code and data are available at
this https URL

本文提出 TediGAN 框架，用于多模态图像生成和操作。该方法包括三个组件：StyleGAN 反演模块，视觉 - 语言相似度学习和实例级优化。模型可使用多模态输入生成多样化，分辨率为 1024 的高质量图像，并提出 Multi-Modal CelebA-HQ 数据集以支持文本引导的多模态综合。经过广泛实验，本方法展现出优越性能。