Vision Language Models (VLMs) excel in zero-shot image classification by
pairing images with textual category names. The expanding variety of
Pre-Trained VLMs enhances the likelihood of identifying a suitable VLM for
specific tasks. Thus, a promising zero-shot image classification strategy is
selecting the most appropriate Pre-Trained VLM from the VLM Zoo, relying solely
on the text data of the target dataset without access to the dataset's images.
In this paper, we analyze two inherent challenges in assessing the ability of a
VLM in this Language-Only VLM selection: the "Modality Gap" -- the disparity in
VLM's embeddings across two different modalities, making text a less reliable
substitute for images; and the "Capability Gap" -- the discrepancy between the
VLM's overall ranking and its ranking for target dataset, hindering direct
prediction of a model's dataset-specific performance from its general
performance. We propose VLM Selection With gAp Bridging (SWAB) to mitigate the
negative impact of these two gaps. SWAB first adopts optimal transport to
capture the relevance between open-source datasets and target dataset with a
transportation matrix. It then uses this matrix to transfer useful statistics
of VLMs from open-source datasets to the target dataset for bridging those two
gaps and enhancing the VLM's capacity estimation for VLM selection. Experiments
across various VLMs and image classification datasets validate SWAB's
effectiveness.

本文分析了在使用语言 - 仅依据进行视觉语义模型 (VLM) 选择中的两个固有挑战：模态差异和能力差异，并提出了一种称为 SWAB 的方法来缓解这两个差距，通过最优传输捕捉开源数据集与目标数据集之间的相关性，并将有用的统计信息从开源数据集传输到目标数据集，从而增强 VLM 在选择中的能力估计。通过在多个 VLM 和图像分类数据集上进行的实验验证了 SWAB 的有效性。

填补视觉语言模型选择中的模态和容量差距

Bridge the Modality and Capacity Gaps in Vision-Language Model Selection

Large Language Models (LLMs) have marked a significant advancement in the
field of natural language processing, demonstrating exceptional capabilities in
reasoning, tool usage, and memory. As their applications extend into
multi-agent environments, a need has arisen for a comprehensive evaluation
framework that captures their abilities in reasoning, planning, collaboration,
and more. This work introduces a novel benchmarking framework specifically
tailored to assess LLMs within multi-agent settings, providing quantitative
metrics to evaluate their judgment, reasoning, deception, self-awareness,
collaboration, coordination, and rationality. We utilize games such as
Chameleon and Undercover, alongside game theory scenarios like Cost Sharing,
Multi-player Prisoner's Dilemma, and Public Good, to create diverse testing
environments. Our framework is fortified with the Probabilistic Graphical
Modeling (PGM) method, enhancing the LLMs' capabilities in navigating complex
social and cognitive dimensions. The benchmark evaluates seven multi-agent
systems powered by different LLMs, quantitatively highlighting a significant
capability gap over threefold between the strongest, GPT-4, and the weakest,
Llama-2-70B. It also confirms that our PGM enhancement boosts the inherent
abilities of all selected models by 50% on average. Our codes are released here
this https URL

这项研究介绍了一个专门用于评估大型语言模型在多主体环境中能力的基准测试框架，通过游戏和博弈论场景来创建不同的测试环境，并利用概率图模型方法增强模型的导航能力，最终量化评估了七种不同大型语言模型的能力，发现最强模型 GPT-4 和最弱模型 Llama-2-70B 之间存在三倍的能力差距，同时证实了概率图模型增强了所有模型的能力，平均提高了 50%。