Sequential vision-to-language or visual storytelling has recently been one of
the areas of focus in computer vision and language modeling domains. Though
existing models generate narratives that read subjectively well, there could be
cases when these models miss out on generating stories that account and address
all prospective human and animal characters in the image sequences. Considering
this scenario, we propose a model that implicitly learns relationships between
provided characters and thereby generates stories with respective characters in
scope. We use the VIST dataset for this purpose and report numerous statistics
on the dataset. Eventually, we describe the model, explain the experiment and
discuss our current status and future work.

该研究利用 VIST 数据集，提出了一个模型，通过隐式学习提供的角色之间的关系，生成关注的角色的故事，旨在解决基于图像序列生成故事时，模型忽略可能存在的人和动物角色的问题。

以人物为中心的叙事

Character-Centric Storytelling

We introduce the first dataset for sequential vision-to-language, and explore
how this data may be used for the task of visual storytelling. The first
release of this dataset, SIND v.1, includes 81,743 unique photos in 20,211
sequences, aligned to both descriptive (caption) and story language. We
establish several strong baselines for the storytelling task, and motivate an
automatic metric to benchmark progress. Modelling concrete description as well
as figurative and social language, as provided in this dataset and the
storytelling task, has the potential to move artificial intelligence from basic
understandings of typical visual scenes towards more and more human-like
understanding of grounded event structure and subjective expression.

首个序列视觉语言数据集的发布，这个数据集中包含 81,743 张唯一图片和 20,211 个序列，旨在探讨其在视觉叙事任务中的应用，建立多个强劲的基础模型以及推动基于自动度量标准的进展，为模拟具象和比喻、社交语言提供了可能，从而推动人工智能不断向更接近人类理解的基于事件结构和主观表达的方向发展。