We present the ShareGPT4Video series, aiming to facilitate the video
understanding of large video-language models (LVLMs) and the video generation
of text-to-video models (T2VMs) via dense and precise captions. The series
comprises: 1) ShareGPT4Video, 40K GPT4V annotated dense captions of videos with
various lengths and sources, developed through carefully designed data
filtering and annotating strategy. 2) ShareCaptioner-Video, an efficient and
capable captioning model for arbitrary videos, with 4.8M high-quality aesthetic
videos annotated by it. 3) ShareGPT4Video-8B, a simple yet superb LVLM that
reached SOTA performance on three advancing video benchmarks. To achieve this,
taking aside the non-scalable costly human annotators, we find using GPT4V to
caption video with a naive multi-frame or frame-concatenation input strategy
leads to less detailed and sometimes temporal-confused results. We argue the
challenge of designing a high-quality video captioning strategy lies in three
aspects: 1) Inter-frame precise temporal change understanding. 2) Intra-frame
detailed content description. 3) Frame-number scalability for arbitrary-length
videos. To this end, we meticulously designed a differential video captioning
strategy, which is stable, scalable, and efficient for generating captions for
videos with arbitrary resolution, aspect ratios, and length. Based on it, we
construct ShareGPT4Video, which contains 40K high-quality videos spanning a
wide range of categories, and the resulting captions encompass rich world
knowledge, object attributes, camera movements, and crucially, detailed and
precise temporal descriptions of events. Based on ShareGPT4Video, we further
develop ShareCaptioner-Video, a superior captioner capable of efficiently
generating high-quality captions for arbitrary videos...

通过稠密和精确的字幕，在大视频 - 语言模型（LVLMs）的视频理解和文本 - 视频模型（T2VMs）的视频生成方面，我们提出了 ShareGPT4Video 系列，该系列包括 40K GPT4V 标注的各种长度和来源的视频稠密字幕，通过精心设计的数据过滤和注释策略进行开发，以及有效的任意视频字幕模型 ShareCaptioner-Video 和卓越的 LVLM ShareGPT4Video-8B。

ShareGPT4Video: 提升视频理解与生成，优化字幕

ShareGPT4Video: Improving Video Understanding and Generation with Better  Captions

Recent advances in large video-language models have displayed promising
outcomes in video comprehension. Current approaches straightforwardly convert
video into language tokens and employ large language models for multi-modal
tasks. However, this method often leads to the generation of irrelevant
content, commonly known as "hallucination", as the length of the text increases
and the impact of the video diminishes. To address this problem, we propose
Vista-LLaMA, a novel framework that maintains the consistent distance between
all visual tokens and any language tokens, irrespective of the generated text
length. Vista-LLaMA omits relative position encoding when determining attention
weights between visual and text tokens, retaining the position encoding for
text and text tokens. This amplifies the effect of visual tokens on text
generation, especially when the relative distance is longer between visual and
text tokens. The proposed attention mechanism significantly reduces the chance
of producing irrelevant text related to the video content. Furthermore, we
present a sequential visual projector that projects the current video frame
into tokens of language space with the assistance of the previous frame. This
approach not only captures the temporal relationship within the video, but also
allows less visual tokens to encompass the entire video. Our approach
significantly outperforms various previous methods (e.g., Video-ChatGPT,
MovieChat) on four challenging open-ended video question answering benchmarks.
We reach an accuracy of 60.7 on the zero-shot NExT-QA and 60.5 on the zero-shot
MSRVTT-QA, setting a new state-of-the-art performance. This project is
available at this https URL

对于大文本的视觉问题，当前的方法存在产生相关文本的概率较高的问题。本文提出了 Vista-LLaMA 框架，采用了一种新的注意机制，通过保持视觉和文本间的一致距离，特别在相对距离较长的情况下提高了视觉令牌对于文本生成的影响，从而显著降低了生成无关文本的概率。此外，还引入了顺序视觉投影器来处理视频的时间关系，并在四个挑战性的视频问题回答基准测试中表现出优越性能。