In this work, we study the problem of generating novel images from complex
multimodal prompt sequences. While existing methods achieve promising results
for text-to-image generation, they often struggle to capture fine-grained
details from lengthy prompts and maintain contextual coherence within prompt
sequences. Moreover, they often result in misaligned image generation for
prompt sequences featuring multiple objects. To address this, we propose a
Multi-modal Generation via Cross-Modal In-Context Learning (MGCC) method that
generates novel images from complex multimodal prompt sequences by leveraging
the combined capabilities of large language models (LLMs) and diffusion models.
Our MGCC comprises a novel Cross-Modal Refinement module to explicitly learn
cross-modal dependencies between the text and image in the LLM embedding space,
and a contextual object grounding module to generate object bounding boxes
specifically targeting scenes with multiple objects. Our MGCC demonstrates a
diverse range of multimodal capabilities, like novel image generation, the
facilitation of multimodal dialogue, and generation of texts. Experimental
evaluations on two benchmark datasets, demonstrate the effectiveness of our
method. On Visual Story Generation (VIST) dataset with multimodal inputs, our
MGCC achieves a CLIP Similarity score of $0.652$ compared to SOTA GILL $0.641$.
Similarly, on Visual Dialogue Context (VisDial) having lengthy dialogue
sequences, our MGCC achieves an impressive CLIP score of $0.660$, largely
outperforming existing SOTA method scoring $0.645$. Code:
this https URL

本研究提出了一种利用大型语言模型和扩散模型的多模式生成方法（MGCC），通过在 LLM 嵌入空间中显式学习文本和图像之间的跨模式依赖关系以及生成特定于多物体场景的对象边界框，实现了从复杂的多模式提示序列中生成新图像的能力，并在两个基准数据集上进行了实验验证。

跨模态上下文学习实现多模态生成

Multi-modal Generation via Cross-Modal In-Context Learning

In recent years, audio-driven 3D facial animation has gained significant
attention, particularly in applications such as virtual reality, gaming, and
video conferencing. However, accurately modeling the intricate and subtle
dynamics of facial expressions remains a challenge. Most existing studies
approach the facial animation task as a single regression problem, which often
fail to capture the intrinsic inter-modal relationship between speech signals
and 3D facial animation and overlook their inherent consistency. Moreover, due
to the limited availability of 3D-audio-visual datasets, approaches learning
with small-size samples have poor generalizability that decreases the
performance. To address these issues, in this study, we propose a cross-modal
dual-learning framework, termed DualTalker, aiming at improving data usage
efficiency as well as relating cross-modal dependencies. The framework is
trained jointly with the primary task (audio-driven facial animation) and its
dual task (lip reading) and shares common audio/motion encoder components. Our
joint training framework facilitates more efficient data usage by leveraging
information from both tasks and explicitly capitalizing on the complementary
relationship between facial motion and audio to improve performance.
Furthermore, we introduce an auxiliary cross-modal consistency loss to mitigate
the potential over-smoothing underlying the cross-modal complementary
representations, enhancing the mapping of subtle facial expression dynamics.
Through extensive experiments and a perceptual user study conducted on the VOCA
and BIWI datasets, we demonstrate that our approach outperforms current
state-of-the-art methods both qualitatively and quantitatively. We have made
our code and video demonstrations available at
this https URL

通过交叉模态的双学习框架和辅助的模态一致性损失，提高数据使用效率、关联交叉模态的依赖关系，并增强微妙面部表情动力学的映射，从而在语音驱动三维面部动画中提高性能。