Recent advancements in Large Language Models (LLMs) have expanded their
capabilities to multimodal contexts, including comprehensive video
understanding. However, processing extensive videos such as 24-hour CCTV
footage or full-length films presents significant challenges due to the vast
data and processing demands. Traditional methods, like extracting key frames or
converting frames to text, often result in substantial information loss. To
address these shortcomings, we develop OmAgent, efficiently stores and
retrieves relevant video frames for specific queries, preserving the detailed
content of videos. Additionally, it features an Divide-and-Conquer Loop capable
of autonomous reasoning, dynamically invoking APIs and tools to enhance query
processing and accuracy. This approach ensures robust video understanding,
significantly reducing information loss. Experimental results affirm OmAgent's
efficacy in handling various types of videos and complex tasks. Moreover, we
have endowed it with greater autonomy and a robust tool-calling system,
enabling it to accomplish even more intricate tasks.

OmAgent 是一个能够在多模态环境下高效地存储和检索视频帧的系统，通过动态调用 API 和工具进行查询处理和准确性增强，可以确保鲁棒的视频理解，显著减少信息丢失。

OmAgent: 复杂视频理解的多模态代理框架与任务分割

OmAgent: A Multi-modal Agent Framework for Complex Video Understanding  with Task Divide-and-Conquer

The proliferation of Large Language Models like ChatGPT has significantly
advanced language understanding and generation, impacting a broad spectrum of
applications. However, these models predominantly excel in text-based tasks,
overlooking the complexity of real-world multimodal information. This study
introduces MultiAPI, a pioneering comprehensive large-scale API benchmark
dataset aimed at expanding LLMs' proficiency in multimodal contexts. Developed
collaboratively through ChatGPT, MultiAPI consists of 235 diverse API calls and
2,038 contextual prompts, offering a unique platform evaluation of
tool-augmented LLMs handling multimodal tasks. Through comprehensive
experiments, our findings reveal that while LLMs demonstrate proficiency in API
call decision-making, they face challenges in domain identification, function
selection, and argument generation. What's more, we surprisingly notice that
auxiliary context can actually impair the performance. An in-depth error
analysis paves the way for a new paradigm to address these challenges,
suggesting a potential direction for future LLM research.

聊天 GPT 通过开发的 MultiAPI 数据集评估了多模态任务中大型语言模型的表现，研究发现在 API 调用决策、领域识别、功能选择和参数生成等方面存在挑战，并提出了解决这些问题的新方法，为未来 LLM 研究指明了方向。

超越文本：通过 MultiAPI 基准评估揭示大型语言模型的多模态能力

Beyond Text: Unveiling Multimodal Proficiency of Large Language Models  with MultiAPI Benchmark

Audio-visual question answering (AVQA) is a challenging task that requires
multistep spatio-temporal reasoning over multimodal contexts. To achieve scene
understanding ability similar to humans, the AVQA task presents specific
challenges, including effectively fusing audio and visual information and
capturing question-relevant audio-visual features while maintaining temporal
synchronization. This paper proposes a Target-aware Joint Spatio-Temporal
Grounding Network for AVQA to address these challenges. The proposed approach
has two main components: the Target-aware Spatial Grounding module, the
Tri-modal consistency loss and corresponding Joint audio-visual temporal
grounding module. The Target-aware module enables the model to focus on
audio-visual cues relevant to the inquiry subject by exploiting the explicit
semantics of text modality. The Tri-modal consistency loss facilitates the
interaction between audio and video during question-aware temporal grounding
and incorporates fusion within a simpler single-stream architecture.
Experimental results on the MUSIC-AVQA dataset demonstrate the effectiveness
and superiority of the proposed method over existing state-of-the-art methods.
Our code will be availiable soon.

本研究提出了一种针对音视频问答（AVQA）任务的目标感知联合时空基础网络，利用三种模态的一致性损失实现了问题感知的时空基础，增加了音频 - 视觉互动，采用了单一流结构中的融合方法，在 MUSIC-AVQA 数据集上的实验结果证明了该方法优越性及其有效性。