Long Video Question Answering (LVQA) is challenging due to the need for temporal reasoning and large-scale multimodal data processing. Existing methods struggle with retrieving cross-modal information from long videos, especially when relevant details are sparsely distributed. We introduce UMaT (Unified Multi-modal as Text), a retrieval-augmented generation (RAG) framework that efficiently processes extremely long videos while maintaining cross-modal coherence. UMaT converts visual and auditory data into a unified textual representation, ensuring semantic and temporal alignment. Short video clips are analyzed using a vision-language model, while automatic speech recognition (ASR) transcribes dialogue. These text-based representations are structured into temporally aligned segments, with adaptive filtering to remove redundancy and retain salient details. The processed data is embedded into a vector database, enabling precise retrieval of dispersed yet relevant content. Experiments on a benchmark LVQA dataset show that UMaT outperforms existing methods in multimodal integration, long-form video understanding, and sparse information retrieval. Its scalability and interpretability allow it to process videos over an hour long while maintaining semantic and temporal coherence. These findings underscore the importance of structured retrieval and multimodal synchronization for advancing LVQA and long-form AI systems.

本研究解决了长视频问答（LVQA）中跨模态信息检索的困难，尤其是在信息稀疏分布的情况下。提出的UMaT框架将视觉和听觉数据转换为统一的文本表示，并通过时间对齐和自适应过滤来提升信息的相关性和准确性。实验表明，UMaT在多模态整合和稀疏信息检索方面优于现有方法，具有良好的可扩展性和可解释性，有助于推进长视频问答及长格式人工智能系统的发展。

万物皆可用语言描述：一种简单的统一多模态框架，具有语义和时间对齐