We present our submission to the Microsoft Video to Language Challenge of generating short captions describing videos in the challenge dataset. Our model is based on the encoder--decoder pipeline, popular in image and video captioning systems. We propose to utilize two different kinds of video features, one to capture the video content in terms of objects and attributes, and the other to capture the motion and action information. Using these diverse features we train models specializing in two separate input sub-domains. We then train an evaluator model which is used to pick the best caption from the pool of candidates generated by these domain expert models. We argue that this approach is better suited for the current video captioning task, compared to using a single model, due to the diversity in the dataset. Efficacy of our method is proven by the fact that it was rated best in MSR Video to Language Challenge, as per human evaluation. Additionally, we were ranked second in the automatic evaluation metrics based table.

本研究以编码器-解码器结构为基础，利用不同的视频特征训练了两个分别负责对象和动作信息的输入子域的模型，并采用一个评估模型从这些专业模型生成的候选语句中选择最佳的视频简述，相较于单一模型，该方法更适用于视频简述任务并在MSR视频语言挑战中获得了最佳人工评价的评级和自动评估度量指标的第二名。

视频字幕生成的帧和片段级特征及候选池评估