We propose a new task, dataset and model for grounded video caption generation. This task unifies captioning and Object Grounding in video, where the objects in the caption are grounded in the video via temporally consistent bounding boxes. We introduce the following contributions. Fir