Recent advances in vision-language pre-training (VLP) have demonstrated
impressive performance in a range of vision-language (VL) tasks. However, there
exist several challenges for measuring the community's progress in building
general multi-modal intelligence. First, most of the downstream VL datasets are
annotated using raw images that are already seen during pre-training, which may
result in an overestimation of current VLP models' generalization ability.
Second, recent VLP work mainly focuses on absolute performance but overlooks
the efficiency-performance trade-off, which is also an important indicator for
measuring progress.
To this end, we introduce the Vision-Language Understanding Evaluation (VLUE)
benchmark, a multi-task multi-dimension benchmark for evaluating the
generalization capabilities and the efficiency-performance trade-off (``Pareto
SOTA'') of VLP models. We demonstrate that there is a sizable generalization
gap for all VLP models when testing on out-of-distribution test sets annotated
on images from a more diverse distribution that spreads across cultures.
Moreover, we find that measuring the efficiency-performance trade-off of VLP
models leads to complementary insights for several design choices of VLP. We
release the VLUE benchmark to promote research on building vision-language
models that generalize well to more diverse images and concepts unseen during
pre-training, and are practical in terms of efficiency-performance trade-off.

本研究介绍了一个名为 VLUE 的视觉语言理解评估基准，可用于评估 VLP 模型的泛化能力和效率 - 性能权衡。该基准显示了所有 VLP 模型在处理来自更多文化领域且未在预训练中出现的图像时存在较大的泛化差距，并且衡量 VLP 模型的效率 - 性能权衡可为设计选择提供有益见解。