It is encouraged to see that progress has been made to bridge videos and
natural language. However, mainstream video captioning methods suffer from slow
inference speed due to the sequential manner of autoregressive decoding, and
prefer generating generic descriptions due to the insuff
本篇研究针对视频字幕模型的解码问题,通过三种技术改进模型的性能,包括使用变分 Dropout 和层归一化改善过拟合问题、提出在线评估模型性能以选择最佳测试检查点的方法、提出专业学习的新训练策略。在 Microsoft Research Video Description Corpus (MSVD) 和 MSR-Video to Text (MSR-VTT) 数据集上进行的实验证明,相较于之前最先进的模型,我们的模型在 BLEU、CIDEr、METEOR 和 ROUGE-L 指标上获得了显著的成果,其中在 MSVD 数据集上提升了高达 18%,在 MSR-VTT 数据集上提升了 3.5%。