In this paper, the problem of describing visual contents of a video sequence with natural language is addressed. Unlike previous video captioning work mainly exploiting the cues of video contents to make a language description, we propose a reconstruction network (RecNet) with a novel encoder-decoder-reconstructor architecture, which leverages both the forward (video to sentence) and backward (sentence to video) flows for video captioning. Specifically, the encoder-decoder makes use of the forward flow to produce the sentence description based on the encoded video semantic features. Two types of reconstructors are customized to employ the backward flow and reproduce the video features based on the hidden state sequence generated by the decoder. The generation loss yielded by the encoder-decoder and the reconstruction loss introduced by the reconstructor are jointly drawn into training the proposed RecNet in an end-to-end fashion. Experimental results on benchmark datasets demonstrate that the proposed reconstructor can boost the encoder-decoder models and leads to significant gains in video caption accuracy.

本文提出了一种重构网络（RecNet）的架构，该网络利用正反两个方向的流动来进行视频字幕生成，其编码器-解码器使用正向流产生编码视频语义特征的句子描述，两种类型的重构器则用于回溯流程并重新生成与解码器生成的隐藏状态序列基于的视频特征。实验结果表明，所提出的重构器网络能够提高编码器-解码器模型的性能，并显著提高视频字幕准确性。

视频字幕重构网络