Image style transfer occupies an important place in both computer graphics and computer vision. However, most current methods require reference to stylized images and cannot individually stylize specific objects. To overcome this limitation, we propose the "Soulstyler" framework, which allows users to guide the stylization of specific objects in an image through simple textual descriptions. We introduce a large language model to parse the text and identify stylization goals and specific styles. Combined with a CLIP-based semantic visual embedding encoder, the model understands and matches text and image content. We also introduce a novel localized text-image block matching loss that ensures that style transfer is performed only on specified target objects, while non-target regions remain in their original style. Experimental results demonstrate that our model is able to accurately perform style transfer on target objects according to textual descriptions without affecting the style of background regions. Our code will be available at https://github.com/yisuanwang/Soulstyler.

通过简单的文本描述，我们提出了“Soulstyler”框架，让用户可以引导对特定物体进行图像风格化处理。我们介绍了一个大型语言模型来解析文本，识别风格化的目标和具体风格，并结合基于CLIP的语义视觉嵌入编码器，使模型能够理解并匹配文本和图像内容。同时，我们还引入了一种新颖的局部文本-图像块匹配损失函数，确保风格转换仅针对指定的目标对象，而非目标区域保持原始风格。实验结果表明，我们的模型能够根据文本描述准确进行目标对象的风格转换，而不影响背景区域的风格。

Soulstyler：基于大型语言模型引导图像风格转换的目标对象