Diffusion Probabilistic Field (DPF) models the distribution of continuous functions defined over metric spaces. While DPF shows great potential for unifying data generation of various modalities including images, videos, and 3D geometry, it does not scale to a higher data resolution. This can be attributed to the ``scaling property'', where it is difficult for the model to capture local structures through uniform sampling. To this end, we propose a new model comprising of a view-wise sampling algorithm to focus on local structure learning, and incorporating additional guidance, e.g., text description, to complement the global geometry. The model can be scaled to generate high-resolution data while unifying multiple modalities. Experimental results on data generation in various modalities demonstrate the effectiveness of our model, as well as its potential as a foundation framework for scalable modality-unified visual content generation.

提出了一种新的模型，该模型结合了以视图为基础的采样算法和额外的指导信息，例如文本描述，用于细节结构学习，使得模型能够扩展到高分辨率数据，统一多种模态下的视觉内容生成。实验结果证明了模型的有效性，以及其作为可伸缩性模态统一视觉内容生成的基础框架的潜力。

在统一的视觉模态上将扩散概率场扩展至高分辨率