Deep video models, for example, 3D CNNs or video transformers, have achieved
promising performance on sparse video tasks, i.e., predicting one result per
video. However, challenges arise when adapting existing deep video models to
dense video tasks, i.e., predicting one result per frame. Specifically, these
models are expensive for deployment, less effective when handling redundant
frames, and difficult to capture long-range temporal correlations. To overcome
these issues, we propose a Temporal Dilated Video Transformer (TDViT) that
consists of carefully designed temporal dilated transformer blocks (TDTB). TDTB
can efficiently extract spatiotemporal representations and effectively
alleviate the negative effect of temporal redundancy. Furthermore, by using
hierarchical TDTBs, our approach obtains an exponentially expanded temporal
receptive field and therefore can model long-range dynamics. Extensive
experiments are conducted on two different dense video benchmarks, i.e.,
ImageNet VID for video object detection and YouTube VIS for video instance
segmentation. Excellent experimental results demonstrate the superior
efficiency, effectiveness, and compatibility of our method. The code is
available at this https URL

我们提出了一种时态扩张视频变换器 (Temporal Dilated Video Transformer, TDViT)，通过使用层次化的时态扩张变换器块 (Temporal Dilated Transformer Blocks, TDTB) 来提取时空表示，并有效缓解时态冗余的负面影响，从而模拟长程动态。通过在两个不同的密集视频基准上进行广泛实验，即用于视频物体检测的 ImageNet VID 和用于视频实例分割的 YouTube VIS，出色的实验结果证明了我们方法的出色效率、有效性和兼容性。

TDViT：用于密集视频任务的时序扩张视频变换器

TDViT: Temporal Dilated Video Transformer for Dense Video Tasks

Existing deep video models are limited by specific tasks, fixed input-output
spaces, and poor generalization capabilities, making it difficult to deploy
them in real-world scenarios. In this paper, we present our vision for
multimodal and versatile video understanding and propose a prototype system,
\system. Our system is built upon a tracklet-centric paradigm, which treats
tracklets as the basic video unit and employs various Video Foundation Models
(ViFMs) to annotate their properties e.g., appearance, motion, \etc. All the
detected tracklets are stored in a database and interact with the user through
a database manager. We have conducted extensive case studies on different types
of in-the-wild videos, which demonstrates the effectiveness of our method in
answering various video-related problems. Our project is available at
this https URL

本文提出了一种基于轨迹为中心的多模态视频理解原型系统，使用各种视频基础模型（ViFMs）注释其特性，存储在数据库中，并通过数据库管理器与用户交互，以解决各种视频相关问题。

ChatVideo: 基于 Tracklet 的多模式通用视频理解系统

ChatVideo: A Tracklet-centric Multimodal and Versatile Video  Understanding System

Training competitive deep video models is an order of magnitude slower than
training their counterpart image models. Slow training causes long research
cycles, which hinders progress in video understanding research. Following
standard practice for training image models, video model training assumes a
fixed mini-batch shape: a specific number of clips, frames, and spatial size.
However, what is the optimal shape? High resolution models perform well, but
train slowly. Low resolution models train faster, but they are inaccurate.
Inspired by multigrid methods in numerical optimization, we propose to use
variable mini-batch shapes with different spatial-temporal resolutions that are
varied according to a schedule. The different shapes arise from resampling the
training data on multiple sampling grids. Training is accelerated by scaling up
the mini-batch size and learning rate when shrinking the other dimensions. We
empirically demonstrate a general and robust grid schedule that yields a
significant out-of-the-box training speedup without a loss in accuracy for
different models (I3D, non-local, SlowFast), datasets (Kinetics,
Something-Something, Charades), and training settings (with and without
pre-training, 128 GPUs or 1 GPU). As an illustrative example, the proposed
multigrid method trains a ResNet-50 SlowFast network 4.5x faster (wall-clock
time, same hardware) while also improving accuracy (+0.8% absolute) on
Kinetics-400 compared to the baseline training method. Code is available
online.

通过使用多重网格方法和变量小批量形状，以在保持准确性的同时加速视频模型的训练速度，我们提出了一种通用和强健的网格时间表，该时间表可用于不同的模型，数据集和训练设置。