We present the ShareGPT4Video series, aiming to facilitate the video
understanding of large video-language models (LVLMs) and the video generation
of text-to-video models (T2VMs) via dense and precise captions. The series
comprises: 1) ShareGPT4Video, 40K GPT4V annotated dense captions of videos with
various lengths and sources, developed through carefully designed data
filtering and annotating strategy. 2) ShareCaptioner-Video, an efficient and
capable captioning model for arbitrary videos, with 4.8M high-quality aesthetic
videos annotated by it. 3) ShareGPT4Video-8B, a simple yet superb LVLM that
reached SOTA performance on three advancing video benchmarks. To achieve this,
taking aside the non-scalable costly human annotators, we find using GPT4V to
caption video with a naive multi-frame or frame-concatenation input strategy
leads to less detailed and sometimes temporal-confused results. We argue the
challenge of designing a high-quality video captioning strategy lies in three
aspects: 1) Inter-frame precise temporal change understanding. 2) Intra-frame
detailed content description. 3) Frame-number scalability for arbitrary-length
videos. To this end, we meticulously designed a differential video captioning
strategy, which is stable, scalable, and efficient for generating captions for
videos with arbitrary resolution, aspect ratios, and length. Based on it, we
construct ShareGPT4Video, which contains 40K high-quality videos spanning a
wide range of categories, and the resulting captions encompass rich world
knowledge, object attributes, camera movements, and crucially, detailed and
precise temporal descriptions of events. Based on ShareGPT4Video, we further
develop ShareCaptioner-Video, a superior captioner capable of efficiently
generating high-quality captions for arbitrary videos...

通过稠密和精确的字幕，在大视频 - 语言模型（LVLMs）的视频理解和文本 - 视频模型（T2VMs）的视频生成方面，我们提出了 ShareGPT4Video 系列，该系列包括 40K GPT4V 标注的各种长度和来源的视频稠密字幕，通过精心设计的数据过滤和注释策略进行开发，以及有效的任意视频字幕模型 ShareCaptioner-Video 和卓越的 LVLM ShareGPT4Video-8B。

ShareGPT4Video: 提升视频理解与生成，优化字幕

ShareGPT4Video: Improving Video Understanding and Generation with Better  Captions

Existing text-to-image diffusion models struggle to synthesize realistic
images given dense captions, where each text prompt provides a detailed
description for a specific image region. To address this, we propose
DenseDiffusion, a training-free method that adapts a pre-trained text-to-image
model to handle such dense captions while offering control over the scene
layout. We first analyze the relationship between generated images' layouts and
the pre-trained model's intermediate attention maps. Next, we develop an
attention modulation method that guides objects to appear in specific regions
according to layout guidance. Without requiring additional fine-tuning or
datasets, we improve image generation performance given dense captions
regarding both automatic and human evaluation scores. In addition, we achieve
similar-quality visual results with models specifically trained with layout
conditions.

通过 DenseDiffusion 方法，我们能够在不需额外微调或数据集的情况下，有效改善给定密集描述的图像生成性能，并且达到与专门训练有场景布局条件的模型相似的视觉效果。

注意力调制下的密集文本到图像生成

Dense Text-to-Image Generation with Attention Modulation

Text-to-Image (T2I) ReID has attracted a lot of attention in the recent past.
CUHK-PEDES, RSTPReid and ICFG-PEDES are the three available benchmarks to
evaluate T2I ReID methods. RSTPReid and ICFG-PEDES comprise of identities from
MSMT17 but due to limited number of unique persons, the diversity is limited.
On the other hand, CUHK-PEDES comprises of 13,003 identities but has relatively
shorter text description on average. Further, these datasets are captured in a
restricted environment with limited number of cameras. In order to further
diversify the identities and provide dense captions, we propose a novel dataset
called IIITD-20K. IIITD-20K comprises of 20,000 unique identities captured in
the wild and provides a rich dataset for text-to-image ReID. With a minimum of
26 words for a description, each image is densely captioned. We further
synthetically generate images and fine-grained captions using Stable-diffusion
and BLIP models trained on our dataset. We perform elaborate experiments using
state-of-art text-to-image ReID models and vision-language pre-trained models
and present a comprehensive analysis of the dataset. Our experiments also
reveal that synthetically generated data leads to a substantial performance
improvement in both same dataset as well as cross dataset settings. Our dataset
is available at this https URL

提出了一个新的名为 IIITD-20K 的数据集，包括 20000 个在野外抓取的唯一身份的密集标题，使用生成图像和精细的标题进一步多样化身份，并进行了实验以将其与目前最先进的文本到图像 ReID 模型进行对比。