Self-supervised representation learning targets to learn convnet-based image
representations from unlabeled data. Inspired by the success of NLP methods in
this area, in this work we propose a self-supervised approach based on
spatially dense image descriptions that encode discrete visual concepts, here
called visual words. To build such discrete representations, we quantize the
feature maps of a first pre-trained self-supervised convnet, over a k-means
based vocabulary. Then, as a self-supervised task, we train another convnet to
predict the histogram of visual words of an image (i.e., its Bag-of-Words
representation) given as input a perturbed version of that image. The proposed
task forces the convnet to learn perturbation-invariant and context-aware image
features, useful for downstream image understanding tasks. We extensively
evaluate our method and demonstrate very strong empirical results, e.g., our
pre-trained self-supervised representations transfer better on detection task
and similarly on classification over classes "unseen" during pre-training, when
compared to the supervised case.
This also shows that the process of image discretization into visual words
can provide the basis for very powerful self-supervised approaches in the image
domain, thus allowing further connections to be made to related methods from
the NLP domain that have been extremely successful so far.

本文提出了一种基于视觉词汇的自监督学习方法，通过将图像特征映射量化为视觉词汇，实现了对图像的分离表示，通过 Bag-of-Words 表示学习有用的下游图像理解特征，与类似自然语言领域的方法相比，该方法在目标检测和分类上表现出更好的迁移能力。

通过预测视觉单词包学习表征

Learning Representations by Predicting Bags of Visual Words

It is encouraged to see that progress has been made to bridge videos and
natural language. However, mainstream video captioning methods suffer from slow
inference speed due to the sequential manner of autoregressive decoding, and
prefer generating generic descriptions due to the insufficient training of
visual words (e.g., nouns and verbs) and inadequate decoding paradigm. In this
paper, we propose a non-autoregressive decoding based model with a
coarse-to-fine captioning procedure to alleviate these defects. In
implementations, we employ a bi-directional self-attention based network as our
language model for achieving inference speedup, based on which we decompose the
captioning procedure into two stages, where the model has different focuses.
Specifically, given that visual words determine the semantic correctness of
captions, we design a mechanism of generating visual words to not only promote
the training of scene-related words but also capture relevant details from
videos to construct a coarse-grained sentence "template". Thereafter, we devise
dedicated decoding algorithms that fill in the "template" with suitable words
and modify inappropriate phrasing via iterative refinement to obtain a
fine-grained description. Extensive experiments on two mainstream video
captioning benchmarks, i.e., MSVD and MSR-VTT, demonstrate that our approach
achieves state-of-the-art performance, generates diverse descriptions, and
obtains high inference efficiency. Our code is available at
this https URL

本文提出了一种非自回归解码的模型，使用基于双向自注意力的语言模型来加速推理，生成视频字幕的过程分为两个阶段，通过迭代的修改，得到高质量的细致视频描述，大量实验表明该方法达到了最先进的性能，并获得了高推理效率。

非自回归式的粗到细视频字幕

Non-Autoregressive Coarse-to-Fine Video Captioning

Personal robots and driverless cars need to be able to operate in novel
environments and thus quickly and efficiently learn to recognise new object
classes. We address this problem by considering the task of video object
segmentation. Previous accurate methods for this task finetune a model using
the first annotated frame, and/or use additional inputs such as optical flow
and complex post-processing. In contrast, we develop a fast, causal algorithm
that requires no finetuning, auxiliary inputs or post-processing, and segments
a variable number of objects in a single forward-pass. We represent an object
with clusters, or "visual words", in the embedding space, which correspond to
object parts in the image space. This allows us to robustly match to the
reference objects throughout the video, because although the global appearance
of an object changes as it undergoes occlusions and deformations, the
appearance of more local parts may stay consistent. We learn these visual words
in an unsupervised manner, using meta-learning to ensure that our training
objective matches our inference procedure. We achieve comparable accuracy to
finetuning based methods (whilst being 1 to 2 orders of magnitude faster), and
state-of-the-art in terms of speed/accuracy trade-offs on four video
segmentation datasets. Code is available at
this https URL

通过使用聚类，元学习和嵌入式空间中的视觉单词等技术，我们开发了一种快速，因果算法，可在单个前向传递中分割可变数量的对象，并在四个视频分割数据集上实现了最新的速度 / 精度折衷，在不需要调优，附加输入或后处理的情况下。