Recent Large Language Models have been enhanced with vision capabilities,
enabling them to comprehend images, videos, and interleaved vision-language
content. However, the learning methods of these large multimodal models
typically treat videos as predetermined clips, making them less effective and
efficient at handling streaming video inputs. In this paper, we propose a novel
Learning-In-Video-Stream (LIVE) framework, which enables temporally aligned,
long-context, and real-time conversation within a continuous video stream. Our
LIVE framework comprises comprehensive approaches to achieve video streaming
dialogue, encompassing: (1) a training objective designed to perform language
modeling for continuous streaming inputs, (2) a data generation scheme that
converts offline temporal annotations into a streaming dialogue format, and (3)
an optimized inference pipeline to speed up the model responses in real-world
video streams. With our LIVE framework, we built VideoLLM-online model upon
Llama-2/Llama-3 and demonstrate its significant advantages in processing
streaming videos. For instance, on average, our model can support streaming
dialogue in a 5-minute video clip at over 10 FPS on an A100 GPU. Moreover, it
also showcases state-of-the-art performance on public offline video benchmarks,
such as recognition, captioning, and forecasting. The code, model, data, and
demo have been made available at this https URL

通过学习视频流进行大规模语言模型增强，提供视觉能力及实时对话功能，以应对视频流输入的视频流对话学习目标、数据生成方案和优化推断流程的新颖学习框架。

视频 LLM-online：用于流媒体视频的在线视频大语言模型

VideoLLM-online: Online Video Large Language Model for Streaming Video

Multimodal large language models (MLLMs) have shown remarkable capabilities
across a broad range of tasks but their knowledge and abilities in the
geographic and geospatial domains are yet to be explored, despite potential
wide-ranging benefits to navigation, environmental research, urban development,
and disaster response. We conduct a series of experiments exploring various
vision capabilities of MLLMs within these domains, particularly focusing on the
frontier model GPT-4V, and benchmark its performance against open-source
counterparts. Our methodology involves challenging these models with a
small-scale geographic benchmark consisting of a suite of visual tasks, testing
their abilities across a spectrum of complexity. The analysis uncovers not only
where such models excel, including instances where they outperform humans, but
also where they falter, providing a balanced view of their capabilities in the
geographic domain. To enable the comparison and evaluation of future models,
our benchmark will be publicly released.

通过进行一系列实验，我们研究了多模态大型语言模型在地理和地理空间领域的知识和能力，重点关注前沿模型 GPT-4V 的视觉能力，并与开源模型进行性能比较。我们的方法涉及使用一套地理任务的小规模基准测试这些模型，测试它们在不同难度任务上的能力。分析结果揭示了这些模型的优点，包括超过人类的性能，并揭示了它们的不足之处，提供了它们在地理领域能力的全面视角。为了促进未来模型的比较和评估，我们将公开发布我们的基准测试。

开拓新领域：探索多模态 LLMs 的地理和地理空间能力

Charting New Territories: Exploring the Geographic and Geospatial  Capabilities of Multimodal LLMs

In the rapidly evolving landscape of human-computer interaction, the
integration of vision capabilities into conversational agents stands as a
crucial advancement. This paper presents an initial implementation of a
dialogue manager that leverages the latest progress in Large Language Models
(e.g., GPT-4, IDEFICS) to enhance the traditional text-based prompts with
real-time visual input. LLMs are used to interpret both textual prompts and
visual stimuli, creating a more contextually aware conversational agent. The
system's prompt engineering, incorporating dialogue with summarisation of the
images, ensures a balance between context preservation and computational
efficiency. Six interactions with a Furhat robot powered by this system are
reported, illustrating and discussing the results obtained. By implementing
this vision-enabled dialogue system, the paper envisions a future where
conversational agents seamlessly blend textual and visual modalities, enabling
richer, more context-aware dialogues.

本论文提出了一个初步实现的对话管理器，利用最新的大型语言模型（如 GPT-4，IDEFICS）来将视觉能力整合到对话代理中，以增强传统的基于文本的提示与实时视觉输入。该系统的提示工程结合了对图像的对话与摘要，以确保在上下文保留和计算效率之间保持平衡。通过实现这种视觉使能的对话系统，本论文展望了未来，让对话代理无缝地融合文本和视觉模态，实现更丰富、更上下文感知的对话。