We present a method to generate 3D objects in styles. Our method takes a text
prompt and a style reference image as input and reconstructs a neural radiance
field to synthesize a 3D model with the content aligning with the text prompt
and the style following the reference image. To simultaneously generate the 3D
object and perform style transfer in one go, we propose a stylized score
distillation loss to guide a text-to-3D optimization process to output visually
plausible geometry and appearance. Our stylized score distillation is based on
a combination of an original pretrained text-to-image model and its modified
sibling with the key and value features of self-attention layers manipulated to
inject styles from the reference image. Comparisons with state-of-the-art
methods demonstrated the strong visual performance of our method, further
supported by the quantitative results from our user study.

通过输入文本和风格参考图像，我们提出了一种生成风格化的三维对象的方法，利用神经辐射场重建来合成与文本提示一致内容和风格参考图像的三维模型，并通过样式化评分蒸馏损失来指导文本到三维的优化过程，输出视觉合理的几何和外观。经与现有方法的比较表明我们方法在视觉上表现强大，同时通过用户研究的定量结果得到进一步支持。

Dream-in-Style: 使用风格化得分蒸馏的文本到 3D 生成

Dream-in-Style: Text-to-3D Generation using Stylized Score Distillation

Visual Geo-localization (VG) refers to the process to identify the location
described in query images, which is widely applied in robotics field and
computer vision tasks, such as autonomous driving, metaverse, augmented
reality, and SLAM. In fine-grained images lacking specific text descriptions,
directly applying pure visual methods to represent neighborhood features often
leads to the model focusing on overly fine-grained features, unable to fully
mine the semantic information in the images. Therefore, we propose a two-stage
training method to enhance visual performance and use contrastive learning to
mine challenging samples. We first leverage the multi-modal description
capability of CLIP (Contrastive Language-Image Pretraining) to create a set of
learnable text prompts for each geographic image feature to form vague
descriptions. Then, by utilizing dynamic text prompts to assist the training of
the image encoder, we enable the image encoder to learn better and more
generalizable visual features. This strategy of applying text to purely visual
tasks addresses the challenge of using multi-modal models for geographic
images, which often suffer from a lack of precise descriptions, making them
difficult to utilize widely. We validate the effectiveness of the proposed
strategy on several large-scale visual geo-localization datasets, and our
method achieves competitive results on multiple visual geo-localization
datasets. Our code and model are available at
this https URL

使用 CLIP 和对比学习方法提高视觉地理定位中的视觉性能，并解决使用多模态模型处理地理图像时所面临的挑战。

ProGEO：通过图像 - 文本对比学习生成提示，用于视觉地理定位

ProGEO: Generating Prompts through Image-Text Contrastive Learning for  Visual Geo-localization

Most of the existing video face super-resolution (VFSR) methods are trained
and evaluated on VoxCeleb1, which is designed specifically for speaker
identification and the frames in this dataset are of low quality. As a
consequence, the VFSR models trained on this dataset can not output
visual-pleasing results. In this paper, we develop an automatic and scalable
pipeline to collect a high-quality video face dataset (VFHQ), which contains
over $16,000$ high-fidelity clips of diverse interview scenarios. To verify the
necessity of VFHQ, we further conduct experiments and demonstrate that VFSR
models trained on our VFHQ dataset can generate results with sharper edges and
finer textures than those trained on VoxCeleb1. In addition, we show that the
temporal information plays a pivotal role in eliminating video consistency
issues as well as further improving visual performance. Based on VFHQ, by
analyzing the benchmarking study of several state-of-the-art algorithms under
bicubic and blind settings. See our project page:
this https URL

本论文开发了一个自动和可扩展的管道来收集高质量的视频脸部数据集（VFHQ），并证明基于 VFHQ 训练的视频面部超分辨率（VFSR）模型可以产生比基于 VoxCeleb1 训练的模型更锐利的边缘和更细的纹理，同时时序信息在消除视频一致性问题以及进一步提高视觉性能方面也起着关键作用。