We propose the ViNet architecture for audio-visual saliency prediction. ViNet
is a fully convolutional encoder-decoder architecture. The encoder uses visual
features from a network trained for action recognition, and the decoder infers
a saliency map via trilinear interpolation and 3D convolutions, combining
features from multiple hierarchies. The overall architecture of ViNet is
conceptually simple; it is causal and runs in real-time (60 fps). ViNet does
not use audio as input and still outperforms the state-of-the-art audio-visual
saliency prediction models on nine different datasets (three visual-only and
six audio-visual datasets). ViNet also surpasses human performance on the CC,
SIM and AUC metrics for the AVE dataset, and to our knowledge, it is the first
network to do so. We also explore a variation of ViNet architecture by
augmenting audio features into the decoder. To our surprise, upon sufficient
training, the network becomes agnostic to the input audio and provides the same
output irrespective of the input. Interestingly, we also observe similar
behaviour in the previous state-of-the-art models \cite{tsiami2020stavis} for
audio-visual saliency prediction. Our findings contrast with previous works on
deep learning-based audio-visual saliency prediction, suggesting a clear avenue
for future explorations incorporating audio in a more effective manner. The
code and pre-trained models are available at
this https URL

提出了 ViNet 架构用于音频 - 视觉显著性预测，其采用全卷积编码器 - 解码器架构，利用动作识别网络的视觉特征来编码，通过三线性插值和 3D 卷积生成显著性图，没有使用音频作为输入，但是在 9 个不同的数据集上，仍然优于现有的音频 - 视觉显著性预测模型，而且还超过人类在某些度量标准上的表现，此外，还探索了一个在解码器中将音频特征纳入架构中的变体，得出了一些有趣的结论。

ViNet: 将视觉模态推至极限，用于音频视觉显著性预测

ViNet: Pushing the limits of Visual Modality for Audio-Visual Saliency  Prediction

3D semantic scene labeling is fundamental to agents operating in the real
world. In particular, labeling raw 3D point sets from sensors provides
fine-grained semantics. Recent works leverage the capabilities of Neural
Networks (NNs), but are limited to coarse voxel predictions and do not
explicitly enforce global consistency. We present SEGCloud, an end-to-end
framework to obtain 3D point-level segmentation that combines the advantages of
NNs, trilinear interpolation(TI) and fully connected Conditional Random Fields
(FC-CRF). Coarse voxel predictions from a 3D Fully Convolutional NN are
transferred back to the raw 3D points via trilinear interpolation. Then the
FC-CRF enforces global consistency and provides fine-grained semantics on the
points. We implement the latter as a differentiable Recurrent NN to allow joint
optimization. We evaluate the framework on two indoor and two outdoor 3D
datasets (NYU V2, S3DIS, KITTI, Semantic3D.net), and show performance
comparable or superior to the state-of-the-art on all datasets.

本研究介绍了 SEGCloud，一种利用三线性插值和完全连接条件随机场等方式得到端到端三维点级分割的框架，可实现对室内和室外三维数据集进行准确的场景标注。