To achieve disentangled image manipulation, previous works depend heavily on
manual annotation. Meanwhile, the available manipulations are limited to a
pre-defined set the models were trained for. We propose a novel framework,
i.e., Predict, Prevent, and Evaluate (PPE), for disentangled text-driven image
manipulation that requires little manual annotation while being applicable to a
wide variety of manipulations. Our method approaches the targets by deeply
exploiting the power of the large-scale pre-trained vision-language model CLIP.
Concretely, we firstly Predict the possibly entangled attributes for a given
text command. Then, based on the predicted attributes, we introduce an
entanglement loss to Prevent entanglements during training. Finally, we propose
a new evaluation metric to Evaluate the disentangled image manipulation. We
verify the effectiveness of our method on the challenging face editing task.
Extensive experiments show that the proposed PPE framework achieves much better
quantitative and qualitative results than the up-to-date StyleCLIP baseline.

该研究提出了一种新的基于文本的图像操作框架，该框架几乎不需要手动注释，并使用大规模预训练的视觉语言模型 CLIP，通过预测属性、引入脱缰损失和提出新的评估指标来实现图像操作的解开，该框架在复杂的人脸编辑任务中获得比现有 StyleCLIP 基准更好的定量和定性结果。