Southeast Asia (SEA) is a region of extraordinary linguistic and cultural diversity, yet it remains significantly underrepresented in vision-language (VL) research. This often results in artificial intelligence (AI) models that fail to capture SEA cultural nuances. To fill this gap, we present SEA-VL, an open-source initiative dedicated to developing high-quality, culturally relevant data for SEA languages. By involving contributors from SEA countries, SEA-VL aims to ensure better cultural relevance and diversity, fostering greater inclusivity of underrepresented languages in VL research. Beyond crowdsourcing, our initiative goes one step further in the exploration of the automatic collection of culturally relevant images through crawling and image generation. First, we find that image crawling achieves approximately ~85% cultural relevance while being more cost- and time-efficient than crowdsourcing. Second, despite the substantial progress in generative vision models, synthetic images remain unreliable in accurately reflecting SEA cultures. The generated images often fail to reflect the nuanced traditions and cultural contexts of the region. Collectively, we gather 1.28M SEA culturally-relevant images, more than 50 times larger than other existing datasets. Through SEA-VL, we aim to bridge the representation gap in SEA, fostering the development of more inclusive AI systems that authentically represent diverse cultures across SEA.

本研究针对东南亚在视觉语言研究中的严重缺乏，提出了SEA-VL数据集，以高质量和文化相关的数据填补这一空白。通过结合众包、图像抓取和生成等多种方式，发现抓取图像在文化相关性方面效果佳且更具成本效益，同时揭示了生成图像在准确反映东南亚文化方面的局限性。该数据集将有效促进东南亚文化的可视化研究，推动包容性AI系统的发展。

众包、抓取还是生成？创建SEA-VL，面向东南亚的多文化视觉语言数据集