Classifying videos into distinct categories, such as Sport and Music Video,
is crucial for multimedia understanding and retrieval, especially in an age
where an immense volume of video content is constantly being generated.
Traditional methods require video decompression to extract pixel-level features
like color, texture, and motion, thereby increasing computational and storage
demands. Moreover, these methods often suffer from performance degradation in
low-quality videos. We present a novel approach that examines only the
post-compression bitstream of a video to perform classification, eliminating
the need for bitstream. We validate our approach using a custom-built data set
comprising over 29,000 YouTube video clips, totaling 6,000 hours and spanning
11 distinct categories. Our preliminary evaluations indicate precision,
accuracy, and recall rates well over 80%. The algorithm operates approximately
15,000 times faster than real-time for 30fps videos, outperforming traditional
Dynamic Time Warping (DTW) algorithm by six orders of magnitude.

视频分类中，通过研究视频的压缩比特流来代替传统方法中涉及视频解压缩的特征提取，以提高分类性能和处理速度。通过验证自定义数据集，结果表明该方法在精度、准确率和召回率方面都超过 80%，而且处理速度是真实时间的 15000 倍，比传统的动态时间规整算法效果好六个数量级。

以比特流封面判断视频

Judging a video by its bitstream cover

Localizing objects in 3D scenes according to the semantics of a given natural
language is a fundamental yet important task in the field of multimedia
understanding, which benefits various real-world applications such as robotics
and autonomous driving. However, the majority of existing 3D object grounding
methods are restricted to a single-sentence input describing an individual
object, which cannot comprehend and reason more contextualized descriptions of
multiple objects in more practical 3D cases. To this end, we introduce a new
challenging task, called 3D Dense Object Grounding (3D DOG), to jointly
localize multiple objects described in a more complicated paragraph rather than
a single sentence. Instead of naively localizing each sentence-guided object
independently, we found that dense objects described in the same paragraph are
often semantically related and spatially located in a focused region of the 3D
scene. To explore such semantic and spatial relationships of densely referred
objects for more accurate localization, we propose a novel Stacked Transformer
based framework for 3D DOG, named 3DOGSFormer. Specifically, we first devise a
contextual query-driven local transformer decoder to generate initial grounding
proposals for each target object. Then, we employ a proposal-guided global
transformer decoder that exploits the local object features to learn their
correlation for further refining initial grounding proposals. Extensive
experiments on three challenging benchmarks (Nr3D, Sr3D, and ScanRefer) show
that our proposed 3DOGSFormer outperforms state-of-the-art 3D single-object
grounding methods and their dense-object variants by significant margins.

通过语义定位 3D 场景中的物体是多媒体理解领域的一项基础且重要的任务，本研究提出了一种名为 3D Dense Object Grounding (3D DOG) 的新任务，通过更复杂的段落描述而不是单个句子来共同定位多个物体，提出了一种基于 Stacked Transformer 的新框架 3DOGSFormer，通过上下文查询驱动的局部 Transformer 解码器生成初始定位提议，并利用提议驱动的全局 Transformer 解码器进一步优化初始定位提议，实验证明该方法在多个具有挑战性的基准上胜过现有的 3D 单个物体定位方法和它们的稠密对象变种。

3D 场景中的密集物体定位

Dense Object Grounding in 3D Scenes

Action recognition is an important problem in multimedia understanding. This
paper addresses this problem by building an expressive compositional action
model. We model one action instance in the video with an ensemble of
spatio-temporal compositions: a number of discrete temporal anchor frames, each
of which is further decomposed to a layout of deformable parts. In this way,
our model can identify a Spatio-Temporal And-Or Graph (STAOG) to represent the
latent structure of actions e.g. triple jumping, swinging and high jumping. The
STAOG model comprises four layers: (i) a batch of leaf-nodes in bottom for
detecting various action parts within video patches; (ii) the or-nodes over
bottom, i.e. switch variables to activate their children leaf-nodes for
structural variability; (iii) the and-nodes within an anchor frame for
verifying spatial composition; and (iv) the root-node at top for aggregating
scores over temporal anchor frames. Moreover, the contextual interactions are
defined between leaf-nodes in both spatial and temporal domains. For model
training, we develop a novel weakly supervised learning algorithm which
iteratively determines the structural configuration (e.g. the production of
leaf-nodes associated with the or-nodes) along with the optimization of
multi-layer parameters. By fully exploiting spatio-temporal compositions and
interactions, our approach handles well large intra-class action variance (e.g.
different views, individual appearances, spatio-temporal structures). The
experimental results on the challenging databases demonstrate superior
performance of our approach over other competing methods.

通过构建表达力强的组合行为模型，模拟视频中动作实例的时空组合，采用弱监督学习算法，识别行为的潜在结构，最终实验结果表明该方法在动作识别方面的表现优于竞争方法。