The task of spatiotemporal action localization in chaotic scenes is a
challenging task toward advanced video understanding. Paving the way with
high-quality video feature extraction and enhancing the precision of
detector-predicted anchors can effectively improve model performance. To this
end, we propose a high-performance dual-stream spatiotemporal feature
extraction network SFMViT with an anchor pruning strategy. The backbone of our
SFMViT is composed of ViT and SlowFast with prior knowledge of spatiotemporal
action localization, which fully utilizes ViT's excellent global feature
extraction capabilities and SlowFast's spatiotemporal sequence modeling
capabilities. Secondly, we introduce the confidence maximum heap to prune the
anchors detected in each frame of the picture to filter out the effective
anchors. These designs enable our SFMViT to achieve a mAP of 26.62% in the
Chaotic World dataset, far exceeding existing models. Code is available at
this https URL

通过使用具有高质量视频特征提取的双流时空特征提取网络 SFMViT 和锚定修剪策略，我们有效地提高了模型性能，并在混乱场景中实现了 26.62% 的平均精度 (mAP)。

SFMViT：慢快相遇在混沌世界中

SFMViT: SlowFast Meet ViT in Chaotic World

Frame quality deterioration is one of the main challenges in the field of
video understanding. To compensate for the information loss caused by
deteriorated frames, recent approaches exploit transformer-based integration
modules to obtain spatio-temporal information. However, these integration
modules are heavy and complex. Furthermore, each integration module is
specifically tailored for its target task, making it difficult to generalise to
multiple tasks. In this paper, we present a neat and unified framework, called
Spatio-Temporal Prompting Network (STPN). It can efficiently extract robust and
accurate video features by dynamically adjusting the input features in the
backbone network. Specifically, STPN predicts several video prompts containing
spatio-temporal information of neighbour frames. Then, these video prompts are
prepended to the patch embeddings of the current frame as the updated input for
video feature extraction. Moreover, STPN is easy to generalise to various video
tasks because it does not contain task-specific modules. Without bells and
whistles, STPN achieves state-of-the-art performance on three widely-used
datasets for different video understanding tasks, i.e., ImageNetVID for video
object detection, YouTubeVIS for video instance segmentation, and GOT-10k for
visual object tracking. Code is available at
this https URL

帧质量下降是视频理解领域中的主要挑战之一。为了弥补由于帧质量下降而引起的信息损失，最近的方法利用基于 Transformer 的集成模块来获得时空信息。然而，这些集成模块过于复杂和繁重。在本文中，我们提出了一个简洁且统一的框架，称为时空提示网络 (STPN)。它通过动态调整骨干网络中的输入特征，可以高效地提取稳健准确的视频特征。此外，STPN 易于推广到各种视频任务，因为它不包含任务特定的模块。没有花哨的设计，STPN 在三个广泛使用的数据集上取得了最先进的性能，涵盖了不同的视频理解任务，例如用于视频对象检测的 ImageNetVID，用于视频实例分割的 YouTubeVIS 以及用于视觉目标跟踪的 GOT-10k。

稳健视频特征提取的时空提示网络

Spatio-temporal Prompting Network for Robust Video Feature Extraction

Untrimmed videos have interrelated events, dependencies, context, overlapping
events, object-object interactions, domain specificity, and other semantics
that are worth highlighting while describing a video in natural language. Owing
to such a vast diversity, a single sentence can only correctly describe a
portion of the video. Dense Video Captioning (DVC) aims at detecting and
describing different events in a given video. The term DVC originated in the
2017 ActivityNet challenge, after which considerable effort has been made to
address the challenge. Dense Video Captioning is divided into three sub-tasks:
(1) Video Feature Extraction (VFE), (2) Temporal Event Localization (TEL), and
(3) Dense Caption Generation (DCG). This review aims to discuss all the studies
that claim to perform DVC along with its sub-tasks and summarize their results.
We also discuss all the datasets that have been used for DVC. Lastly, we
highlight some emerging challenges and future trends in the field.

使用 Dense Video Captioning (DVC) 技术，本文综述了在描述长视频时需要突出显示的相互关联事件、依赖关系、上下文、重叠事件、物体间的相互作用以及领域特定性等语义，同时讨论了 DVC 的子任务和它们的结果，涵盖视频特征提取、时间事件定位和密集字幕生成，还探讨了 DVC 所使用的数据集以及领域中的新挑战和未来趋势。