Video captioning is process of summarising the content, event and action of
the video into a short textual form which can be helpful in many research areas
such as video guided machine translation, video sentiment analysis and
providing aid to needy individual. In this paper, a system description of the
framework used for VATEX-2020 video captioning challenge is presented. We
employ an encoder-decoder based approach in which the visual features of the
video are encoded using 3D convolutional neural network (C3D) and in the
decoding phase two Long Short Term Memory (LSTM) recurrent networks are used in
which visual features and input captions are fused separately and final output
is generated by performing element-wise product between the output of both
LSTMs. Our model is able to achieve BLEU scores of 0.20 and 0.22 on public and
private test data sets respectively.

本文介绍了用于视频字幕挑战的框架，采用编码器 - 解码器的方法，其中使用 3D 卷积神经网络对视频进行编码，并使用两个 LSTM 递归网络进行解码，最终输出是两个 LSTM 的输出元素乘积，而此模型可以在公共和私人测试数据集上实现 BLEU 得分分别为 0.20 和 0.22。

2020 VATEX 视频字幕挑战赛 NITS-VC 系统

NITS-VC System for VATEX Video Captioning Challenge 2020

We address the problem of video representation learning without
human-annotated labels. While previous efforts address the problem by designing
novel self-supervised tasks using video data, the learned features are merely
on a frame-by-frame basis, which are not applicable to many video analytic
tasks where spatio-temporal features are prevailing. In this paper we propose a
novel self-supervised approach to learn spatio-temporal features for video
representation. Inspired by the success of two-stream approaches in video
classification, we propose to learn visual features by regressing both motion
and appearance statistics along spatial and temporal dimensions, given only the
input video data. Specifically, we extract statistical concepts (fast-motion
region and the corresponding dominant direction, spatio-temporal color
diversity, dominant color, etc.) from simple patterns in both spatial and
temporal domains. Unlike prior puzzles that are even hard for humans to solve,
the proposed approach is consistent with human inherent visual habits and
therefore easy to answer. We conduct extensive experiments with C3D to validate
the effectiveness of our proposed approach. The experiments show that our
approach can significantly improve the performance of C3D when applied to video
classification tasks. Code is available at
this https URL

在没有人工标注标签的前提下，本文提出了一种自我监督学习方法来学习视频的时空特征，通过回归时空维度上的外观和运动统计量来提取视觉特征，并在视频分类任务中验证了其有效性。

通过预测动态和外观统计信息进行视频自监督时空表示学习

Self-supervised Spatio-temporal Representation Learning for Videos by  Predicting Motion and Appearance Statistics

We propose a simple, yet effective approach for spatiotemporal feature
learning using deep 3-dimensional convolutional networks (3D ConvNets) trained
on a large scale supervised video dataset. Our findings are three-fold: 1) 3D
ConvNets are more suitable for spatiotemporal feature learning compared to 2D
ConvNets; 2) A homogeneous architecture with small 3x3x3 convolution kernels in
all layers is among the best performing architectures for 3D ConvNets; and 3)
Our learned features, namely C3D (Convolutional 3D), with a simple linear
classifier outperform state-of-the-art methods on 4 different benchmarks and
are comparable with current best methods on the other 2 benchmarks. In
addition, the features are compact: achieving 52.8% accuracy on UCF101 dataset
with only 10 dimensions and also very efficient to compute due to the fast
inference of ConvNets. Finally, they are conceptually very simple and easy to
train and use.

通过在大规模监督视频数据集上使用训练的深度三维卷积神经网络（3D ConvNets）提出了一种简单而有效的时空特征学习方法。我们的成果有三个：1）相对于 2D ConvNets，3D ConvNets 更适用于时空特征学习；2）所有层中具有小的 3x3x3 卷积核的同构体系结构是 3D ConvNets 中表现最佳的体系结构之一；3）我们学到的特征 —— 即 C3D（卷积 3D）—— 连同一个简单的线性分类器，在 4 个不同的基准测试中优于最先进的方法，并与其他 2 个基准测试中的最佳方法相当。此外，这些特征紧凑：只需 10 维便能在 UCF101 数据集上达到 52.8％的准确率，由于 ConvNets 的快速推理，计算效率也非常高。最后，它们在概念上非常简单易用且易于训练和使用。