We present LoCoVQA, a dynamic benchmark generator for evaluating long-context
extractive reasoning in vision language models (VLMs). LoCoVQA augments test
examples for mathematical reasoning, VQA, and character recognition tasks with
increasingly long visual contexts composed of both in-distribution and
out-of-distribution distractor images.
Across these tasks, a diverse set of VLMs rapidly lose performance as the
visual context length grows, often exhibiting a striking exponential decay
trend. This test assesses how well VLMs can ignore irrelevant information when
answering queries -- a task that is quite easy for language models (LMs) in the
text domain -- demonstrating that current state-of-the-art VLMs lack this
essential capability for many long-context applications.

LoCoVQA 是一个用于评估视觉语言模型（VLM）中的长篇上下文抽取推理的动态基准生成器。该测试评估了 VLM 在回答问题时如何忽略无关信息的能力，表明目前的最先进 VLM 在许多长篇上下文应用中缺乏这种关键能力。

图像中的视觉针在感知和描述的背景中很容易迷失

Losing Visual Needles in Image Haystacks: Vision Language Models are  Easily Distracted in Short and Long Contexts

Video anomaly detection (VAD) holds immense importance across diverse domains
such as surveillance, healthcare, and environmental monitoring. While numerous
surveys focus on conventional VAD methods, they often lack depth in exploring
specific approaches and emerging trends. This survey explores deep
learning-based VAD, expanding beyond traditional supervised training paradigms
to encompass emerging weakly supervised, self-supervised, and unsupervised
approaches. A prominent feature of this review is the investigation of core
challenges within the VAD paradigms including large-scale datasets, features
extraction, learning methods, loss functions, regularization, and anomaly score
prediction. Moreover, this review also investigates the vision language models
(VLMs) as potent feature extractors for VAD. VLMs integrate visual data with
textual descriptions or spoken language from videos, enabling a nuanced
understanding of scenes crucial for anomaly detection. By addressing these
challenges and proposing future research directions, this review aims to foster
the development of robust and efficient VAD systems leveraging the capabilities
of VLMs for enhanced anomaly detection in complex real-world scenarios. This
comprehensive analysis seeks to bridge existing knowledge gaps, provide
researchers with valuable insights, and contribute to shaping the future of VAD
research.

通过深度学习方法的视频异常检测（VAD）调查，本篇综述探索了传统监督训练范式之外的新兴弱监督、自监督和无监督方法，研究了 VAD 范式中的核心挑战，以及视觉语言模型（VLMs）作为强大的特征提取器在 VAD 中的作用，旨在增强复杂现实场景中异常检测的鲁棒性和效率，并填补现有知识差距，为研究人员提供有价值的见解，为 VAD 研究的未来贡献力量。

10 年来视频异常检测：综述与展望

Video Anomaly Detection in 10 Years: A Survey and Outlook

Vision language models (VLMs) have drastically changed the computer vision
model landscape in only a few years, opening an exciting array of new
applications from zero-shot image classification, over to image captioning, and
visual question answering. Unlike pure vision models, they offer an intuitive
way to access visual content through language prompting. The wide applicability
of such models encourages us to ask whether they also align with human vision -
specifically, how far they adopt human-induced visual biases through multimodal
fusion, or whether they simply inherit biases from pure vision models. One
important visual bias is the texture vs. shape bias, or the dominance of local
over global information. In this paper, we study this bias in a wide range of
popular VLMs. Interestingly, we find that VLMs are often more shape-biased than
their vision encoders, indicating that visual biases are modulated to some
extent through text in multimodal models. If text does indeed influence visual
biases, this suggests that we may be able to steer visual biases not just
through visual input but also through language: a hypothesis that we confirm
through extensive experiments. For instance, we are able to steer shape bias
from as low as 49% to as high as 72% through prompting alone. For now, the
strong human bias towards shape (96%) remains out of reach for all tested VLMs.

通过对多模态模型的研究，发现视觉语言模型（VLMs）比纯视觉模型更倾向于形状（shape）偏好，并且通过语言提示可以通过 VLMs 来引导形状偏好的变化。