Video understanding is a crucial next step for multimodal large language
models (MLLMs). To probe specific aspects of video understanding ability,
existing video benchmarks typically require careful video selection based on
the target capability, along with laborious annotation of query-response pairs
to match the specific video content. This process is both challenging and
resource-intensive. In this paper, we propose VideoNIAH (Video Needle In A
Haystack), a benchmark construction framework through synthetic video
generation. VideoNIAH decouples test video content from their query-responses
by inserting unrelated image/text 'needles' into original videos. It generates
annotations solely from these needles, ensuring diversity in video sources and
a variety of query-responses. Additionally, by inserting multiple needles,
VideoNIAH rigorously evaluates the temporal understanding capabilities of
models. We utilized VideoNIAH to compile a video benchmark VNBench, including
tasks such as retrieval, ordering, and counting. VNBench can efficiently
evaluate the fine-grained understanding ability and spatio-temporal modeling
ability of a video model, while also supporting the long-context evaluation.
Additionally, we evaluated recent video-centric multimodal large language
models (MLLMs), both open-source and proprietary, providing a comprehensive
analysis. We found that although proprietary models have significant advantages
over open-source models, all existing video models still perform poorly on
long-distance dependency tasks. VideoNIAH is a simple yet highly scalable
benchmark construction framework, and we believe it will inspire future video
benchmark works. The code and data are available at
this https URL

VideoNIAH 是一个简单但高度可伸缩的基准构建框架，通过合成视频生成，将测试视频内容与查询 - 响应解耦，并通过插入多个不相关的图像 / 文本 ' 针' 来生成注释，从而确保视频来源的多样性和各种查询 - 响应。

视频多媒体语言模型基准测试的可扩展合成框架

Needle In A Video Haystack: A Scalable Synthetic Framework for  Benchmarking Video MLLMs

Predicting which specific parts of a video users will replay is important for
several applications, including targeted advertisement placement on video
platforms and assisting video creators. In this work, we explore whether it is
possible to predict the Most Replayed (MR) data from YouTube videos. To this
end, we curate a large video benchmark, the YTMR500 dataset, which comprises
500 YouTube videos with MR data annotations. We evaluate Deep Learning (DL)
models of varying complexity on our dataset and perform an extensive ablation
study. In addition, we conduct a user study to estimate the human performance
on MR data prediction. Our results show that, although by a narrow margin, all
the evaluated DL models outperform random predictions. Additionally, they
exceed human-level accuracy. This suggests that predicting the MR data is a
difficult task that can be enhanced through the assistance of DL. Finally, we
believe that DL performance on MR data prediction can be further improved, for
example, by using multi-modal learning. We encourage the research community to
use our benchmark dataset to further investigate automatic MR data prediction.

利用深度学习模型对 YouTube 视频的 MR（Most Replayed）数据进行预测，并通过评估多个模型在 YTMR500 数据集上的性能，显示出这是困难的任务，但所有模型都优于随机预测，并且超过了人类水平准确度。鼓励研究社区使用我们的基准数据集进一步研究自动 MR 数据预测。

预测视频流媒体平台上最多重播的数据

Can we predict the Most Replayed data of video streaming platforms?

Driver attention prediction is currently becoming the focus in safe driving
research community, such as the DR(eye)VE project and newly emerged Berkeley
DeepDrive Attention (BDD-A) database in critical situations. In safe driving,
an essential task is to predict the incoming accidents as early as possible.
BDD-A was aware of this problem and collected the driver attention in
laboratory because of the rarity of such scenes. Nevertheless, BDD-A focuses
the critical situations which do not encounter actual accidents, and just faces
the driver attention prediction task, without a close step for accident
prediction. In contrast to this, we explore the view of drivers' eyes for
capturing multiple kinds of accidents, and construct a more diverse and larger
video benchmark than ever before with the driver attention and the driving
accident annotation simultaneously (named as DADA-2000), which has 2000 video
clips owning about 658,476 frames on 54 kinds of accidents. These clips are
crowd-sourced and captured in various occasions (highway, urban, rural, and
tunnel), weather (sunny, rainy and snowy) and light conditions (daytime and
nighttime). For the driver attention representation, we collect the maps of
fixations, saccade scan path and focusing time. The accidents are annotated by
their categories, the accident window in clips and spatial locations of the
crash-objects. Based on the analysis, we obtain a quantitative and positive
answer for the question in this paper.

本文提出了一个新的基于驾驶员目光追踪和车祸注释的视频基准测试集（DADA-2000），涵盖了 54 种不同类型的车祸，可以更全面地预测即将发生的事故。