Understanding spatial relations is a crucial cognitive ability for both humans and AI. While current research has predominantly focused on the benchmarking of text-to-image (T2I) models, we propose a more comprehensive evaluation that includes \textit{both} T2I and Large Language Models (LLMs). As spatial relations are naturally understood in a visuo-spatial manner, we develop an approach to convert LLM outputs into an image, thereby allowing us to evaluate both T2I models and LLMs \textit{visually}. We examined the spatial relation understanding of 8 prominent generative models (3 T2I models and 5 LLMs) on a set of 10 common prepositions, as well as assess the feasibility of automatic evaluation methods. Surprisingly, we found that T2I models only achieve subpar performance despite their impressive general image-generation abilities. Even more surprisingly, our results show that LLMs are significantly more accurate than T2I models in generating spatial relations, despite being primarily trained on textual data. We examined reasons for model failures and highlight gaps that can be filled to enable more spatially faithful generations.

本研究解决了空间关系生成模型性能的评估问题，特别是文本到图像（T2I）模型与大型语言模型（LLMs）之间的比较。通过将LLM的输出转换为图像，我们提出了一种新的评估方法，发现LLMs在生成空间关系方面显著优于T2I模型，这一发现揭示了当前图像生成技术中的潜在不足和改进方向。

评估文本和图像生成模型中的空间关系生成