Audio Description (AD) is the task of generating descriptions of visual content, at suitable time intervals, for the benefit of visually impaired audiences. For movies, this presents notable challenges -- AD must occur only during existing pauses in dialogue, should refer to characters by name, and ought to aid understanding of the storyline as a whole. To this end, we develop a new model for automatically generating movie AD, given CLIP visual features of the frames, the cast list, and the temporal locations of the speech; addressing all three of the 'who', 'when', and 'what' questions: (i) who -- we introduce a character bank consisting of the character's name, the actor that played the part, and a CLIP feature of their face, for the principal cast of each movie, and demonstrate how this can be used to improve naming in the generated AD; (ii) when -- we investigate several models for determining whether an AD should be generated for a time interval or not, based on the visual content of the interval and its neighbours; and (iii) what -- we implement a new vision-language model for this task, that can ingest the proposals from the character bank, whilst conditioning on the visual features using cross-attention, and demonstrate how this improves over previous architectures for AD text generation in an apples-to-apples comparison.

为了解决电影音频描述中存在的挑战，如适应对话存在的间隙、通过角色名称引用以及整体上帮助理解剧情，我们开发了一个新模型来自动生成电影音频描述，使用了帧的CLIP视觉特征、演职员表和对话的时间位置，并解决了'谁'、'何时'和'什么'三个问题，即通过引入角色库实现更好地命名、通过对时间间隔及其相邻内容的视觉特征进行模型选择以决定是否生成音频描述，以及通过视觉特征的交叉注意力在此任务中实现了一个新的视觉-语言模型，展示了与先前架构相比在音频描述生成方面的改进。

自动广告II：续集--电影音频描述中的受众、时间和内容