3D scene segmentation based on neural implicit representation has emerged
recently with the advantage of training only on 2D supervision. However,
existing approaches still requires expensive per-scene optimization that
prohibits generalization to novel scenes during inference. To circumvent this
problem, we introduce a generalizable 3D segmentation framework based on
implicit representation. Specifically, our framework takes in multi-view image
features and semantic maps as the inputs instead of only spatial information to
avoid overfitting to scene-specific geometric and semantic information. We
propose a novel soft voting mechanism to aggregate the 2D semantic information
from different views for each 3D point. In addition to the image features, view
difference information is also encoded in our framework to predict the voting
scores. Intuitively, this allows the semantic information from nearby views to
contribute more compared to distant ones. Furthermore, a visibility module is
also designed to detect and filter out detrimental information from occluded
views. Due to the generalizability of our proposed method, we can synthesize
semantic maps or conduct 3D semantic segmentation for novel scenes with solely
2D semantic supervision. Experimental results show that our approach achieves
comparable performance with scene-specific approaches. More importantly, our
approach can even outperform existing strong supervision-based approaches with
only 2D annotations. Our source code is available at:
this https URL

基于神经隐式表示的 3D 场景分割方法，通过多视图图像特征和语义地图作为输入，采用软投票机制来聚合来自不同视图的二维语义信息，结合视角差异信息预测投票分数，通过可见性模块筛选掉遮挡视图的有害信息，在只有二维语义监督的情况下，能够综合合成语义地图或进行新场景的三维语义分割。

GNeSF：泛化的神经语义场

GNeSF: Generalizable Neural Semantic Fields

Different from Visual Question Answering task that requires to answer only
one question about an image, Visual Dialogue involves multiple questions which
cover a broad range of visual content that could be related to any objects,
relationships or semantics. The key challenge in Visual Dialogue task is thus
to learn a more comprehensive and semantic-rich image representation which may
have adaptive attentions on the image for variant questions. In this research,
we propose a novel model to depict an image from both visual and semantic
perspectives. Specifically, the visual view helps capture the appearance-level
information, including objects and their relationships, while the semantic view
enables the agent to understand high-level visual semantics from the whole
image to the local regions. Futhermore, on top of such multi-view image
features, we propose a feature selection framework which is able to adaptively
capture question-relevant information hierarchically in fine-grained level. The
proposed method achieved state-of-the-art results on benchmark Visual Dialogue
datasets. More importantly, we can tell which modality (visual or semantic) has
more contribution in answering the current question by visualizing the gate
values. It gives us insights in understanding of human cognition in Visual
Dialogue.

该研究提出了一种新的模型来从视觉和语义两个角度描述图像，在多角度图像特征的基础上提出了特征选择框架，逐层适应性地捕捉问题相关信息，并在基准视觉对话数据集上取得了最先进的结果。更重要的是，通过可视化门控值，我们能够确定视觉和语义哪个模式在回答当前问题中发挥更重要的作用，为我们理解人类认知在视觉对话中的作用提供了见解。