As a fine-grained video understanding task, dense video captioning involves first localizing events in a video and then generating captions for the identified events. We present the Joint Event Detection and Description Network (JEDDi-Net) that solves the dense captioning task in an end-to-end fashion. Our model continuously encodes the input video stream with three-dimensional convolutional layers and proposes variable-length temporal events based on pooled features. In order to explicitly model temporal relationships between visual events and their captions in a single video, we propose a two-level hierarchical LSTM module that transcribes the event proposals into captions. Unlike existing dense video captioning approaches, our proposal generation and language captioning networks are trained end-to-end, allowing for improved temporal segmentation. On the large-scale ActivityNet Captions dataset, JEDDi-Net demonstrates improved results as measured by most language generation metrics. We also present the first dense captioning results on the TACoS-MultiLevel dataset.

JEDDi-Net是一种用于密集视频字幕生成的神经网络，它通过三维卷积层对输入视频流进行连续编码，并使用时间池化特征提出可变长度的时间事件，再生成它们的字幕。在大规模数据集上，JEDDi-Net 表现出了优异的性能。

连续视频流中的事件检测和描述