Automatically generating a natural language sentence to describe the content
of an input video is a very challenging problem. It is an essential multimodal
task in which auditory and visual contents are equally important. Although
audio information has been exploited to improve video captioni