Current Pose-Guided Person Image Synthesis (PGPIS) methods depend heavily on large amounts of labeled triplet data to train the generator in a supervised manner. However, they often falter when applied to in-the-wild samples, primarily due to the distribution gap between the training datasets and real-world test samples. While some researchers aim to enhance model generalizability through sophisticated training procedures, advanced architectures, or by creating more diverse datasets, we adopt the test-time fine-tuning paradigm to customize a pre-trained Text2Image (T2I) model. However, naively applying test-time tuning results in inconsistencies in facial identities and appearance attributes. To address this, we introduce a Visual Consistency Module (VCM), which enhances appearance consistency by combining the face, text, and image embedding. Our approach, named OnePoseTrans, requires only a single source image to generate high-quality pose transfer results, offering greater stability than state-of-the-art data-driven methods. For each test case, OnePoseTrans customizes a model in around 48 seconds with an NVIDIA V100 GPU.

本研究解决了现有姿态引导人物图像合成方法在野外样本中表现不佳的问题，尤其是在标签三元组数据稀缺的情况下。我们提出了一种名为OnePoseTrans的新方法，通过引入视觉一致性模块（VCM），结合面部、文本和图像嵌入，实现了在仅有单张源图像的情况下的高质量姿态迁移结果。研究表明，该方法在保持外观一致性方面具有显著优势，定制模型的速度可达48秒。 

基于姿态引导的人物图像合成的一次性学习