This paper focuses on building object-centric representations for long-term
action anticipation in videos. Our key motivation is that objects provide
important cues to recognize and predict human-object interactions, especially
when the predictions are longer term, as an observed "background" object could
be used by the human actor in the future. We observe that existing object-based
video recognition frameworks either assume the existence of in-domain
supervised object detectors or follow a fully weakly-supervised pipeline to
infer object locations from action labels. We propose to build object-centric
video representations by leveraging visual-language pretrained models. This is
achieved by "object prompts", an approach to extract task-specific
object-centric representations from general-purpose pretrained models without
finetuning. To recognize and predict human-object interactions, we use a
Transformer-based neural architecture which allows the "retrieval" of relevant
objects for action anticipation at various time scales. We conduct extensive
evaluations on the Ego4D, 50Salads, and EGTEA Gaze+ benchmarks. Both
quantitative and qualitative results confirm the effectiveness of our proposed
method.

本文旨在建立面向视频中长期动作预测的物体中心表示。我们提出利用视觉 - 语言预训练模型构建物体中心视频表示，通过 “物体提示” 从通用预训练模型中提取任务特定的物体中心表示。我们使用基于 Transformer 的神经架构来识别和预测人 - 物交互，并在 Ego4D、50Salads 和 EGTEA Gaze + 基准测试上进行了广泛评估，定量和定性结果证实了我们提出方法的有效性。

以物体为中心的视频表示对长期行动预测

Object-centric Video Representation for Long-term Action Anticipation

Pretrained models have produced great success in both Computer Vision (CV)
and Natural Language Processing (NLP). This progress leads to learning joint
representations of vision and language pretraining by feeding visual and
linguistic contents into a multi-layer transformer, Visual-Language Pretrained
Models (VLPMs). In this paper, we present an overview of the major advances
achieved in VLPMs for producing joint representations of vision and language.
As the preliminaries, we briefly describe the general task definition and
genetic architecture of VLPMs. We first discuss the language and vision data
encoding methods and then present the mainstream VLPM structure as the core
content. We further summarise several essential pretraining and fine-tuning
strategies. Finally, we highlight three future directions for both CV and NLP
researchers to provide insightful guidance.

本文主要介绍了预训练模型在计算机视觉和自然语言处理中所取得的巨大成功，着重介绍了视觉语言预训练模型 (VLPM) 的重要进展及其结构、预训练和微调策略，并提出了未来三个方向的研究建议。