With the recent popularity of animated GIFs on social media, there is need for ways to index them with rich metadata. To advance research on animated GIF understanding, we collected a new dataset, Tumblr GIF (TGIF), with 100K animated GIFs from Tumblr and 120K natural language descriptions obtained via crowdsourcing. The motivation for this work is to develop a testbed for image sequence description systems, where the task is to generate natural language descriptions for animated GIFs or video clips. To ensure a high quality dataset, we developed a series of novel quality controls to validate free-form text input from crowdworkers. We show that there is unambiguous association between visual content and natural language descriptions in our dataset, making it an ideal benchmark for the visual content captioning task. We perform extensive statistical analyses to compare our dataset to existing image and video description datasets. Next, we provide baseline results on the animated GIF description task, using three representative techniques: nearest neighbor, statistical machine translation, and recurrent neural networks. Finally, we show that models fine-tuned from our animated GIF description dataset can be helpful for automatic movie description.

本研究收集了10万个GIF图像，并基于众包技术获取了120k自然语言描述，以促进对动态图像的理解和自然语言描述生成技术的研究，该研究提供了一个理想的基准来评估视觉内容字幕的任务。研究还提供了基于最近邻算法、统计机器翻译和递归神经网络的动态gif描述任务的基线结果，并证明使用该数据集微调的模型对于自动电影描述是有帮助的。

TGIF：一个新的动态图描述数据集与基准