Large multi-modal models (LMMs) hold the potential to usher in a new era of
automated visual assistance for people who are blind or low vision (BLV). Yet,
these models have not been systematically evaluated on data captured by BLV
users. We address this by empirically assessing CLIP, a widely-used LMM likely
to underpin many assistive technologies. Testing 25 CLIP variants in a
zero-shot classification task, we find that their accuracy is 15 percentage
points lower on average for images captured by BLV users than web-crawled
images. This disparity stems from CLIP's sensitivities to 1) image content
(e.g. not recognizing disability objects as well as other objects); 2) image
quality (e.g. not being robust to lighting variation); and 3) text content
(e.g. not recognizing objects described by tactile adjectives as well as visual
ones). We delve deeper with a textual analysis of three common pre-training
datasets: LAION-400M, LAION-2B and DataComp-1B, showing that disability content
is rarely mentioned. We then provide three examples that illustrate how the
performance disparities extend to three downstream models underpinned by CLIP:
OWL-ViT, CLIPSeg and DALL-E2. We find that few-shot learning with as few as 5
images can mitigate CLIP's quality-of-service disparities for BLV users in some
scenarios, which we discuss alongside a set of other possible mitigations.

大型多模态模型 (LMMs) 在为盲人或低视力用户提供自动视觉辅助方面具有潜力。我们通过实证评估 CLIP，在零样本分类任务中测试了 25 个 CLIP 变体，发现其在盲人用户捕获的图像上的准确性平均低了 15 个百分点，原因是 CLIP 对图像内容、图像质量和文本内容的敏感性不足。通过对三个常见的预训练数据集进行文本分析，我们发现残疾内容很少被提及。我们还提供了三个示例，说明性能差异扩展到由 CLIP 支持的三个下游模型：OWL-ViT, CLIPSeg 和 DALL-E2。我们发现使用仅有 5 张图像进行少样本学习可以在某些情况下缓解 BLV 用户的 CLIP 的服务质量差异，我们还讨论了一系列可能的缓解措施。