With the success of Large Language Models (LLMs), a surge of Generative
Vision-Language Models (GVLMs) have been constructed via multimodal instruction
tuning. The tuning recipe substantially deviates from the common contrastive
vision-language learning. However, the performance of GVLMs in multimodal
compositional reasoning remains largely unexplored, as ex