Automatic video description requires the generation of natural language
statements about the actions, events, and objects in the video. An important
human trait, when we describe a video, is that we are able to do this with
variable levels of detail. Different from this, existing approaches for
automatic video descriptions are mostly focused on single sentence generation
at a fixed level of detail. Instead, here we address video description of
manipulation actions where different levels of detail are required for being
able to convey information about the hierarchical structure of these actions
relevant also for modern approaches of robot learning. We propose one hybrid
statistical and one end-to-end framework to address this problem. The hybrid
method needs much less data for training, because it models statistically
uncertainties within the video clips, while in the end-to-end method, which is
more data-heavy, we are directly connecting the visual encoder to the language
decoder without any intermediate (statistical) processing step. Both frameworks
use LSTM stacks to allow for different levels of description granularity and
videos can be described by simple single-sentences or complex multiple-sentence
descriptions. In addition, quantitative results demonstrate that these methods
produce more realistic descriptions than other competing approaches.

提出了一种混合统计和端到端框架来解决视频描述中细节级别、操作行为和层次结构的问题，并且定量结果表明这些方法产生的描述比其他竞争方法更真实。

复杂操作动作视频的多句描述

Multi Sentence Description of Complex Manipulation Action Videos

Automatically describing videos has ever been fascinating. In this work, we
attempt to describe videos from a specific domain - broadcast videos of lawn
tennis matches. Given a video shot from a tennis match, we intend to generate a
textual commentary similar to what a human expert would write on a sports
website. Unlike many recent works that focus on generating short captions, we
are interested in generating semantically richer descriptions. This demands a
detailed low-level analysis of the video content, specially the actions and
interactions among subjects. We address this by limiting our domain to the game
of lawn tennis. Rich descriptions are generated by leveraging a large corpus of
human created descriptions harvested from Internet. We evaluate our method on a
newly created tennis video data set. Extensive analysis demonstrate that our
approach addresses both semantic correctness as well as readability aspects
involved in the task.

本文针对草地网球赛的广播录像，利用从互联网上获得的人类创造的描述生成丰富的语义描述，形似于体育网站上专家人类写的文本评论，经测试能提供准确性和可读性。

TennisVid2Text: 面向特定领域视频的细粒度描述

TennisVid2Text: Fine-grained Descriptions for Domain Specific Videos

Humans can easily describe what they see in a coherent way and at varying
level of detail. However, existing approaches for automatic video description
are mainly focused on single sentence generation and produce descriptions at a
fixed level of detail. In this paper, we address both of these limitations: for
a variable level of detail we produce coherent multi-sentence descriptions of
complex videos. We follow a two-step approach where we first learn to predict a
semantic representation (SR) from video and then generate natural language
descriptions from the SR. To produce consistent multi-sentence descriptions, we
model across-sentence consistency at the level of the SR by enforcing a
consistent topic. We also contribute both to the visual recognition of objects
proposing a hand-centric approach as well as to the robust generation of
sentences using a word lattice. Human judges rate our multi-sentence
descriptions as more readable, correct, and relevant than related work. To
understand the difference between more detailed and shorter descriptions, we
collect and analyze a video description corpus of three levels of detail.

本文介绍了一种通过从视频中学习语义表示（SR）并以 SR 为基础来生成多句连贯的自然语言描述的方法，同时也提出了基于手的视觉识别方法以及利用词格进行句子生成的方法，并通过人类评价证明了该方法比现有相关工作能够生成更可读、准确和相关的描述。