The recent works on Video Object Segmentation achieved remarkable results by
matching dense semantic and instance-level features between the current and
previous frames for long-time propagation. Nevertheless, global feature
matching ignores scene motion context, failing to satisfy temporal consistency.
Even though some methods introduce local matching branch to achieve smooth
propagation, they fail to model complex appearance changes due to the
constraints of the local window. In this paper, we present DeVOS (Deformable
VOS), an architecture for Video Object Segmentation that combines memory-based
matching with motion-guided propagation resulting in stable long-term modeling
and strong temporal consistency. For short-term local propagation, we propose a
novel attention mechanism ADVA (Adaptive Deformable Video Attention), allowing
the adaption of similarity search region to query-specific semantic features,
which ensures robust tracking of complex shape and scale changes. DeVOS employs
an optical flow to obtain scene motion features which are further injected to
deformable attention as strong priors to learnable offsets. Our method achieves
top-rank performance on DAVIS 2017 val and test-dev (88.1%, 83.0%), YouTube-VOS
2019 val (86.6%) while featuring consistent run-time speed and stable memory
consumption

结合基于记忆的匹配和运动引导传播的 DeVOS（可变形视频对象分割）架构，用于视频对象分割，实现稳定的长期建模和强大的时间一致性。

DeVOS: 流引导的可变形变压器用于视频对象分割

DeVOS: Flow-Guided Deformable Transformer for Video Object Segmentation

Addressing the dual challenges of local redundancy and global dependencies in
video understanding, this work innovatively adapts the Mamba to the video
domain. The proposed VideoMamba overcomes the limitations of existing 3D
convolution neural networks and video transformers. Its linear-complexity
operator enables efficient long-term modeling, which is crucial for
high-resolution long video understanding. Extensive evaluations reveal
VideoMamba's four core abilities: (1) Scalability in the visual domain without
extensive dataset pretraining, thanks to a novel self-distillation technique;
(2) Sensitivity for recognizing short-term actions even with fine-grained
motion differences; (3) Superiority in long-term video understanding,
showcasing significant advancements over traditional feature-based models; and
(4) Compatibility with other modalities, demonstrating robustness in
multi-modal contexts. Through these distinct advantages, VideoMamba sets a new
benchmark for video understanding, offering a scalable and efficient solution
for comprehensive video understanding. All the code and models are available at
this https URL

提出了一种名为 VideoMamba 的基于 Mamba 的视频理解方法，克服了现有 3D 卷积神经网络和视频变换器的限制，通过线性复杂度运算实现了高效的长视频建模，同时展示了在视觉域上的可扩展性、在短期行动识别上的敏感性、在长期视频理解上的优越性以及在多模态背景下的兼容性。

VideoMamba：高效视频理解的状态空间模型

VideoMamba: State Space Model for Efficient Video Understanding

While today's video recognition systems parse snapshots or short clips
accurately, they cannot connect the dots and reason across a longer range of
time yet. Most existing video architectures can only process <5 seconds of a
video without hitting the computation or memory bottlenecks.
In this paper, we propose a new strategy to overcome this challenge. Instead
of trying to process more frames at once like most existing methods, we propose
to process videos in an online fashion and cache "memory" at each iteration.
Through the memory, the model can reference prior context for long-term
modeling, with only a marginal cost. Based on this idea, we build MeMViT, a
Memory-augmented Multiscale Vision Transformer, that has a temporal support 30x
longer than existing models with only 4.5% more compute; traditional methods
need >3,000% more compute to do the same. On a wide range of settings, the
increased temporal support enabled by MeMViT brings large gains in recognition
accuracy consistently. MeMViT obtains state-of-the-art results on the AVA,
EPIC-Kitchens-100 action classification, and action anticipation datasets. Code
and models are available at this https URL

本文提出了一种在线处理视频并在迭代过程中缓存 “记忆” 的新策略，基于此构建了一个具有 30 倍增强的时间支持的存储器增强多尺度视觉变压器 ——MeMViT，可以比传统方法少使用 99.5％的计算资源，且在各种情况下实现了状态下最先进的识别准确率，尤其是在行动预测数据集方面。