Deep neural networks used in computer vision have been shown to exhibit many social biases such as gender bias. Vision Transformers (ViTs) have become increasingly popular in computer vision applications, outperforming Convolutional Neural Networks (CNNs) in many tasks such as image classification. However, given that research on mitigating bias in computer vision has primarily focused on CNNs, it is important to evaluate the effect of a different network architecture on the potential for bias amplification. In this paper we therefore introduce a novel metric to measure bias in architectures, Accuracy Difference. We examine bias amplification when models belonging to these two architectures are used as a part of large multimodal models, evaluating the different image encoders of Contrastive Language Image Pretraining which is an important model used in many generative models such as DALL-E and Stable Diffusion. Our experiments demonstrate that architecture can play a role in amplifying social biases due to the different techniques employed by the models for feature extraction and embedding as well as their different learning properties. This research found that ViTs amplified gender bias to a greater extent than CNNs

在计算机视觉中使用的深度神经网络已被证明存在许多社会偏见，如性别偏见。视觉Transformer（ViTs）在图像分类等许多任务中比卷积神经网络（CNNs）表现更出色。然而，鉴于在计算机视觉中减轻偏见的研究主要集中在CNNs上，评估不同网络架构对偏见放大潜力的影响是重要的。因此，本文引入了一种新的度量方法来衡量架构中的偏见，即准确率差异。我们评估了这两种架构属于大型多模态模型的一部分时，偏见放大的情况，并评估了对比性语言图像预训练的不同图像编码器。我们的实验表明，由于在特征提取和嵌入以及不同的学习属性方面采用的不同技术，架构可以在放大社会偏见方面发挥作用。本研究发现，与CNNs相比，ViTs更容易放大性别偏见。

偏置注意力：视觉变换器是否比卷积神经网络更加放大性别偏见？