Generating instructional images of human daily actions from an egocentric
viewpoint serves a key step towards efficient skill transfer. In this paper, we
introduce a novel problem -- egocentric action frame generation. The goal is to
synthesize the action frame conditioning on the user prompt question and an
input egocentric image that captures user's environment. Notably, existing
egocentric datasets lack the detailed annotations that describe the execution
of actions. Additionally, the diffusion-based image manipulation models fail to
control the state change of an action within the corresponding egocentric image
pixel space. To this end, we finetune a visual large language model (VLLM) via
visual instruction tuning for curating the enriched action descriptions to
address our proposed problem. Moreover, we propose to Learn EGOcentric (LEGO)
action frame generation using image and text embeddings from VLLM as additional
conditioning. We validate our proposed model on two egocentric datasets --
Ego4D and Epic-Kitchens. Our experiments show prominent improvement over prior
image manipulation models in both quantitative and qualitative evaluation. We
also conduct detailed ablation studies and analysis to provide insights on our
method.

从以自身为中心的视角生成人类日常行为的指导性图像是有效的技能传递的一个关键步骤。本文提出了一个新颖的问题 —— 以自身为中心的动作帧生成。目标是根据用户提示问题和捕捉用户环境的输入自身中心图像，合成动作帧。值得注意的是，现有的自身中心数据集缺乏描述动作执行的详细注释。此外，基于扩散的图像操作模型无法控制动作在相应自身中心图像像素空间内的状态变化。为此，我们通过视觉指导优化自然语言大型模型（VLLM），以充实的动作描述进行微调，以解决我们提出的问题。此外，我们还提出利用来自 VLLM 的图像和文本嵌入作为附加条件的 Learn EGOcentric（LEGO）动作帧生成方法。我们在两个自身中心数据集 ——Ego4D 和 Epic-Kitchens 上验证了我们的模型。我们的实验证明，在定量和定性评估方面，我们的提出的模型相较于先前的图像操作模型有显著改进。我们还进行了详细的消融研究和分析，为我们的方法提供了深入认识。