With the remarkable advancements in image generation and open-form text generation, the creation of interleaved image-text content has become an increasingly intriguing field. Multimodal story generation, characterized by producing narrative texts and vivid images in an interleaved manner, has emerged as a valuable and practical task with broad applications. However, this task poses significant challenges, as it necessitates the comprehension of the complex interplay between texts and images, and the ability to generate long sequences of coherent, contextually relevant texts and visuals. In this work, we propose SEED-Story, a novel method that leverages a Multimodal Large Language Model (MLLM) to generate extended multimodal stories. Our model, built upon the powerful comprehension capability of MLLM, predicts text tokens as well as visual tokens, which are subsequently processed with an adapted visual de-tokenizer to produce images with consistent characters and styles. We further propose multimodal attention sink mechanism to enable the generation of stories with up to 25 sequences (only 10 for training) in a highly efficient autoregressive manner. Additionally, we present a large-scale and high-resolution dataset named StoryStream for training our model and quantitatively evaluating the task of multimodal story generation in various aspects.

使用多模态大型语言模型（MLLM）提出了SEED-Story，一种新颖的方法，用于生成扩展的多模态故事。模型基于MLLM的强大理解能力，预测文本和视觉标记，并通过适应的视觉解标记器处理视觉标记以生成具有一致的字符和风格的图像。还提出了多模态注意力池机制，以高效的自回归方式生成高达25个序列（仅使用10个进行训练）的故事。此外，还提供了一种名为StoryStream的大规模高分辨率数据集，用于训练模型并在各个方面定量评估多模态故事生成任务。

SEED-Story：利用大型语言模型进行多模式长篇故事生成