Recently, by introducing large-scale dataset and strong transformer network,
video-language pre-training has shown great success especially for retrieval.
Yet, existing video-language transformer models do not explicitly fine-grained
semantic align. In this work, we present Object-aware Transformers, an
object-centric approach that extends video-language transformer to incorporate
object representations. The key idea is to leverage the bounding boxes and
object tags to guide the training process. We evaluate our model on three
standard sub-tasks of video-text matching on four widely used benchmarks. We
also provide deep analysis and detailed ablation about the proposed method. We
show clear improvement in performance across all tasks and datasets considered,
demonstrating the value of a model that incorporates object representations
into a video-language architecture. The code will be released at
https://github.com/FingerRec/OA-Transformer.

本文提出了基于物体感知的 Transformer 模型 Object-aware Transformers，使用边界框和物体标签来引导训练过程，将对象表示法引入视频 - 语言架构中，从而提高了视频文本匹配任务的性能。

针对检索的目标感知视频语言预训练

Object-aware Video-language Pre-training for Retrieval

Cross-modal retrieval between videos and texts has attracted growing
attentions due to the rapid emergence of videos on the web. The current
dominant approach for this problem is to learn a joint embedding space to
measure cross-modal similarities. However, simple joint embeddings are
insufficient to represent complicated visual and textual details, such as
scenes, objects, actions and their compositions. To improve fine-grained
video-text retrieval, we propose a Hierarchical Graph Reasoning (HGR) model,
which decomposes video-text matching into global-to-local levels. To be
specific, the model disentangles texts into hierarchical semantic graph
including three levels of events, actions, entities and relationships across
levels. Attention-based graph reasoning is utilized to generate hierarchical
textual embeddings, which can guide the learning of diverse and hierarchical
video representations. The HGR model aggregates matchings from different
video-text levels to capture both global and local details. Experimental
results on three video-text datasets demonstrate the advantages of our model.
Such hierarchical decomposition also enables better generalization across
datasets and improves the ability to distinguish fine-grained semantic
differences.

提出一种基于 Hierarchical Graph Reasoning (HGR) 的模型，将 video-text matching 分解成全局到局部的语义层次；通过基于注意力的图推理生成层次化的文本嵌入，进而引导学习多样化和分层的视频表示，并通过整合不同的 video-text 层次的匹配来捕获全局和局部细节，从而实现视频和文本之间的交叉检索。