Foundation models and vision-language pre-training have notably advanced Vision Language Models (VLMs), enabling multimodal processing of visual and linguistic data. However, their performance has been typically assessed on general scene understanding - recognizing objects, attributes, and actions - rather than cultural comprehension. This study introduces CulturalVQA, a visual question-answering benchmark aimed at assessing VLM's geo-diverse cultural understanding. We curate a collection of 2,378 image-question pairs with 1-5 answers per question representing cultures from 11 countries across 5 continents. The questions probe understanding of various facets of culture such as clothing, food, drinks, rituals, and traditions. Benchmarking VLMs on CulturalVQA, including GPT-4V and Gemini, reveals disparity in their level of cultural understanding across regions, with strong cultural understanding capabilities for North America while significantly lower performance for Africa. We observe disparity in their performance across cultural facets too, with clothing, rituals, and traditions seeing higher performances than food and drink. These disparities help us identify areas where VLMs lack cultural understanding and demonstrate the potential of CulturalVQA as a comprehensive evaluation set for gauging VLM progress in understanding diverse cultures.

这篇研究介绍了CulturalVQA，它是一个用于评估VLM的地理多样性文化理解能力的视觉问答基准。通过对GPT-4V和Gemini等模型在CulturalVQA上的性能评估，发现它们在不同地区的文化理解水平存在差异，其中北美地区的文化理解能力较强，而非洲地区的性能较低。研究还观察到在不同文化方面存在性能差异，其中服饰、仪式和传统的表现优于食物和饮品。这些差异帮助我们识别VLM在文化理解方面的不足，并展示了CulturalVQA作为一个评估各种文化理解能力的全面数据集的潜力。

文化理解的视觉语言模型基准测试