We introduce a novel text-to-pose video editing method, ReimaginedAct. While existing video editing tasks are limited to changes in attributes, backgrounds, and styles, our method aims to predict open-ended human action changes in video. Moreover, our method can accept not only direct instructional text prompts but also `what if' questions to predict possible action changes. ReimaginedAct comprises video understanding, reasoning, and editing modules. First, an LLM is utilized initially to obtain a plausible answer for the instruction or question, which is then used for (1) prompting Grounded-SAM to produce bounding boxes of relevant individuals and (2) retrieving a set of pose videos that we have collected for editing human actions. The retrieved pose videos and the detected individuals are then utilized to alter the poses extracted from the original video. We also employ a timestep blending module to ensure the edited video retains its original content except where necessary modifications are needed. To facilitate research in text-to-pose video editing, we introduce a new evaluation dataset, WhatifVideo-1.0. This dataset includes videos of different scenarios spanning a range of difficulty levels, along with questions and text prompts. Experimental results demonstrate that existing video editing methods struggle with human action editing, while our approach can achieve effective action editing and even imaginary editing from counterfactual questions.

我们介绍了一种新的文本到动作视频编辑方法ReimaginedAct，它可以预测视频中的人类动作变化，不仅可以接受直接指令文本提示，还可以通过假设性问题来预测可能的动作变化。该方法包括视频理解、推理和编辑模块，并引入了一个新的评估数据集WhatifVideo-1.0，实验证明与现有视频编辑方法相比，我们的方法可以实现有效的动作编辑，甚至可以根据假设性问题进行虚构编辑。

行动再塑：动态人体行为的文本到姿态视频编辑