Style transfer driven by text prompts paved a new path for creatively stylizing the images without collecting an actual style image. Despite having promising results, with text-driven stylization, the user has no control over the stylization. If a user wants to create an artistic image, the user requires fine control over the stylization of various entities individually in the content image, which is not addressed by the current state-of-the-art approaches. On the other hand, diffusion style transfer methods also suffer from the same issue because the regional stylization control over the stylized output is ineffective. To address this problem, We propose a new method Multi-Object Segmented Arbitrary Stylization Using CLIP (MOSAIC), that can apply styles to different objects in the image based on the context extracted from the input prompt. Text-based segmentation and stylization modules which are based on vision transformer architecture, were used to segment and stylize the objects. Our method can extend to any arbitrary objects, styles and produce high-quality images compared to the current state of art methods. To our knowledge, this is the first attempt to perform text-guided arbitrary object-wise stylization. We demonstrate the effectiveness of our approach through qualitative and quantitative analysis, showing that it can generate visually appealing stylized images with enhanced control over stylization and the ability to generalize to unseen object classes.

通过基于文本提示的风格转换为图像创造性地进行样式化铺平了一条新的道路，然而当前的先进方法未解决用户对样式化的精细控制以及对区域样式化控制的需求，为此，我们提出了一种新的方法MOSAIC，可以通过从输入提示中提取的上下文对图像中的不同对象应用样式，通过基于视觉转换器架构的基于文本的分割和样式化模块，我们的方法能够扩展到任意对象和样式，并产生与当前先进方法相比质量更高的图像，我们的方法通过定性和定量分析验证了其效果，并展示了其能够生成外观吸引人的样式化图像、对样式化具有增强控制能力并能够推广到未见过的对象类别。

MOSAIC：使用CLIP进行多对象分割和任意风格化