Recent video and language pretraining frameworks lack the ability to generate
sentences. We present Multimodal Video Generative Pretraining (MV-GPT), a new
pretraining framework for learning from unlabelled videos which can be
effectively used for generative tasks such as multimodal video captioning.
Unlike recent video-language pretraining frameworks, our framework trains both
a multimodal video encoder and a sentence decoder jointly. To overcome the lack
of captions in unlabelled videos, we leverage the future utterance as an
additional text source and propose a bidirectional generation objective -- we
generate future utterances given the present mulitmodal context, and also the
present utterance given future observations. With this objective, we train an
encoder-decoder model end-to-end to generate a caption from raw pixels and
transcribed speech directly. Our model achieves state-of-the-art performance
for multimodal video captioning on four standard benchmarks, as well as for
other video understanding tasks such as VideoQA, video retrieval and action
classification.

提出了一种新的预训练框架 Multimodal Video Generative Pretraining (MV-GPT)，通过利用未标记视频中的未来话语作为附加文本源并引入双向生成目标，以从生图像和录制语音直接生成说明的端到端模型来有效地生成多模态视频说明。

多模态视频字幕生成的端到端生成预训练

End-to-end Generative Pretraining for Multimodal Video Captioning

Single-View depth estimation using the CNNs trained from unlabelled videos
has shown significant promise. However, excellent results have mostly been
obtained in street-scene driving scenarios, and such methods often fail in
other settings, particularly indoor videos taken by handheld devices. In this
work, we establish that the complex ego-motions exhibited in handheld settings
are a critical obstacle for learning depth. Our fundamental analysis suggests
that the rotation behaves as noise during training, as opposed to the
translation (baseline) which provides supervision signals. To address the
challenge, we propose a data pre-processing method that rectifies training
images by removing their relative rotations for effective learning. The
significantly improved performance validates our motivation. Towards end-to-end
learning without requiring pre-processing, we propose an Auto-Rectify Network
with novel loss functions, which can automatically learn to rectify images
during training. Consequently, our results outperform the previous unsupervised
SOTA method by a large margin on the challenging NYUv2 dataset. We also
demonstrate the generalization of our trained model in ScanNet and Make3D, and
the universality of our proposed learning method on 7-Scenes and KITTI
datasets.

提出了一种基于自动矫正网络的数据预处理方法，解决了手持场景下旋转运动对单视角深度估计的干扰问题，并针对不同数据集验证了该方法的有效性及通用性。

无监督室内深度估计的自校正网络

Auto-Rectify Network for Unsupervised Indoor Depth Estimation

We propose KeypointGAN, a new method for recognizing the pose of objects from
a single image that for learning uses only unlabelled videos and a weak
empirical prior on the object poses. Video frames differ primarily in the pose
of the objects they contain, so our method distils the pose information by
analyzing the differences between frames. The distillation uses a new dual
representation of the geometry of objects as a set of 2D keypoints, and as a
pictorial representation, i.e. a skeleton image. This has three benefits: (1)
it provides a tight `geometric bottleneck' which disentangles pose from
appearance, (2) it can leverage powerful image-to-image translation networks to
map between photometry and geometry, and (3) it allows to incorporate empirical
pose priors in the learning process. The pose priors are obtained from unpaired
data, such as from a different dataset or modality such as mocap, such that no
annotated image is ever used in learning the pose recognition network. In
standard benchmarks for pose recognition for humans and faces, our method
achieves state-of-the-art performance among methods that do not require any
labelled images for training.

本文提出了 KeypointGAN 方法，通过从未标注的视频和基于弱领域先验知识的学习，仅使用单个图像就可以识别物体的姿态，利用一个新的物体的双重表示方法，并且这种方法可以在不使用标注图像的情况下获得最新的成果。