We seek to semantically describe a set of images, capturing both the attributes of single images and the variations within the set. Our procedure is analogous to Principle Component Analysis, in which the role of projection vectors is replaced with generated phrases. First, a centroid phrase that has the largest average semantic similarity to the images in the set is generated, where both the computation of the similarity and the generation are based on pretrained vision-language models. Then, the phrase that generates the highest variation among the similarity scores is generated, using the same models. The next phrase maximizes the variance subject to being orthogonal, in the latent space, to the highest-variance phrase, and the process continues. Our experiments show that our method is able to convincingly capture the essence of image sets and describe the individual elements in a semantically meaningful way within the context of the entire set. Our code is available at: https://github.com/OdedH/textual-pca.

该研究旨在运用预先训练的视觉-语言模型，通过生成短语来语义上描述一组图像，从而捕捉到单个图像的属性和整个集合内部的变化，并通过对不同短语之间的相似度进行计算和比较，从而将图像集合的关键特征有效地捕捉和说明。 

用文本-PCA描述图像集