Large Vision and Language Models have enabled significant advances in fully
supervised and zero-shot vision tasks. These large pre-trained architectures
serve as the baseline to what is currently known as Instruction Tuning Large
Vision and Language models (IT-LVLMs). IT-LVLMs are general-purpose multi-modal
assistants whose responses are modulated by natural language instructions and
arbitrary visual data. Despite this versatility, IT-LVLM effectiveness in
fundamental computer vision problems remains unclear, primarily due to the
absence of a standardized evaluation benchmark. This paper introduces a
Multi-modal Evaluation Benchmark named MERLIM, a scalable test-bed to assess
the performance of IT-LVLMs on fundamental computer vision tasks. MERLIM
contains over 279K image-question pairs, and has a strong focus on detecting
cross-modal "hallucination" events in IT-LVLMs, where the language output
refers to visual concepts that lack any effective grounding in the image. Our
results show that state-of-the-art IT-LVMLs are still limited at identifying
fine-grained visual concepts, object hallucinations are common across tasks,
and their results are strongly biased by small variations in the input query,
even if the queries have the very same semantics. Our findings also suggest
that these models have weak visual groundings but they can still make adequate
guesses by global visual patterns or textual biases contained in the LLM
component.

本文介绍了一个名为 MERLIM 的多模式评估基准，用于评估 IT-LVLM 在基本计算机视觉任务中的表现，发现先进的 IT-LVLM 仍然有限于识别精细的视觉概念，对象幻觉在各种任务中普遍存在，而且结果受输入查询的细微变化的强烈偏见影响，即使查询具有相同的语义。研究结果还表明，这些模型在视觉基础上较弱，但仍然可以通过全局视觉模式或 LLM 组件中的文本偏见进行恰当的猜测。

魔法后的 MERLIM: 大型图像 - 语言模型的多模态评估基准

Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large  Image-Language Models

Recent advances have substantially improved the accuracy, memory cost, and
training speed of differentially private (DP) deep learning, especially on
large vision and language models with millions to billions of parameters. In
this work, we thoroughly study the per-sample gradient clipping style, a key
component in DP optimization. We show that different clipping styles have the
same time complexity but instantiate an accuracy-memory trade-off: while the
all-layer clipping (of coarse granularity) is the most prevalent and usually
gives the best accuracy, it incurs heavier memory cost compared to other
group-wise clipping, such as the layer-wise clipping (of finer granularity). We
formalize this trade-off through our convergence theory and complexity
analysis. Importantly, we demonstrate that the accuracy gap between group-wise
clipping and all-layer clipping becomes smaller for larger models, while the
memory advantage of the group-wise clipping remains. Consequently, the
group-wise clipping allows DP optimization of large models to achieve high
accuracy and low peak memory simultaneously.

在本研究中，我们深入研究了差分隐私优化中关键组成部分之一的逐样本梯度剪裁方式，发现不同的剪裁方式具有相同的时间复杂度，但存在准确性 - 内存消耗的权衡关系：粗粒度全部层剪裁通常提供最佳准确性，但相比于细粒度的分组剪裁，会带来更高的内存开销。我们通过收敛性理论和复杂性分析形式化表达了这种权衡关系。重要的是，我们证明了在更大的模型中，分组剪裁与全部层剪裁之间的准确性差距越来越小，而分组剪裁的内存优势仍然存在。因此，分组剪裁允许对大型模型进行差分隐私优化，以同时实现高准确性和低内存峰值。