Monocular Depth Estimation (MDE) aims to predict pixel-wise depth given a
single RGB image. For both, the convolutional as well as the recent
attention-based models, encoder-decoder-based architectures have been found to
be useful due to the simultaneous requirement of global context and pixel-level
resolution. Typically, a skip connection module is used to fuse the encoder and
decoder features, which comprises of feature map concatenation followed by a
convolution operation. Inspired by the demonstrated benefits of attention in a
multitude of computer vision problems, we propose an attention-based fusion of
encoder and decoder features. We pose MDE as a pixel query refinement problem,
where coarsest-level encoder features are used to initialize pixel-level
queries, which are then refined to higher resolutions by the proposed Skip
Attention Module (SAM). We formulate the prediction problem as ordinal
regression over the bin centers that discretize the continuous depth range and
introduce a Bin Center Predictor (BCP) module that predicts bins at the
coarsest level using pixel queries. Apart from the benefit of image adaptive
depth binning, the proposed design helps learn improved depth embedding in
initial pixel queries via direct supervision from the ground truth. Extensive
experiments on the two canonical datasets, NYUV2 and KITTI, show that our
architecture outperforms the state-of-the-art by 5.3% and 3.9%, respectively,
along with an improved generalization performance by 9.4% on the SUNRGBD
dataset. Code is available at this https URL

通过引入基于注意力机制的 Skip Attention Module，使得 Monocular Depth Estimation 的编码器和解码器特征更好地融合；并将问题表述为一个像素查询细化问题，利用提出的 Bin Center Predictor 模块进行限制性回归。在 NYUV2 和 KITTI 两个数据集上进行的广泛实验均表明，该架构的性能优于现有技术，并在 SUNRGBD 数据集上具有更好的泛化性能。

跳跃注意力的单目深度预测

Attention Attention Everywhere: Monocular Depth Prediction with Skip  Attention

Face manipulation methods can be misused to affect an individual's privacy or
to spread disinformation. To this end, we introduce a novel data-driven
approach that produces image-specific perturbations which are embedded in the
original images. The key idea is that these protected images prevent face
manipulation by causing the manipulation model to produce a predefined
manipulation target (uniformly colored output image in our case) instead of the
actual manipulation. In addition, we propose to leverage differentiable
compression approximation, hence making generated perturbations robust to
common image compression. In order to prevent against multiple manipulation
methods simultaneously, we further propose a novel attention-based fusion of
manipulation-specific perturbations. Compared to traditional adversarial
attacks that optimize noise patterns for each image individually, our
generalized model only needs a single forward pass, thus running orders of
magnitude faster and allowing for easy integration in image processing stacks,
even on resource-constrained devices like smartphones.

本文提出了一种新的数据驱动方法，通过将保护图像嵌入原始图像来防止面部操纵，生成的扰动对常见图像压缩具有鲁棒性，同时引入关注度融合机制，从而提高保护效果。

TAFIM：针对面部图像篡改的定向对抗攻击

TAFIM: Targeted Adversarial Attacks against Facial Image Manipulations

Accurate detection of obstacles in 3D is an essential task for autonomous
driving and intelligent transportation. In this work, we propose a general
multimodal fusion framework FusionPainting to fuse the 2D RGB image and 3D
point clouds at a semantic level for boosting the 3D object detection task.
Especially, the FusionPainting framework consists of three main modules: a
multi-modal semantic segmentation module, an adaptive attention-based semantic
fusion module, and a 3D object detector. First, semantic information is
obtained for 2D images and 3D Lidar point clouds based on 2D and 3D
segmentation approaches. Then the segmentation results from different sensors
are adaptively fused based on the proposed attention-based semantic fusion
module. Finally, the point clouds painted with the fused semantic label are
sent to the 3D detector for obtaining the 3D objection results. The
effectiveness of the proposed framework has been verified on the large-scale
nuScenes detection benchmark by comparing it with three different baselines.
The experimental results show that the fusion strategy can significantly
improve the detection performance compared to the methods using only point
clouds, and the methods using point clouds only painted with 2D segmentation
information. Furthermore, the proposed approach outperforms other
state-of-the-art methods on the nuScenes testing benchmark.

提出了一个称为 “FusionPainting” 的多模态融合框架，它可以在语义级别上融合 2D RGB 图像和 3D 点云以提高 3D 障碍物检测性能，并在 nuScenes 检测基准测试中显示出优于其他现有方法的性能。

FusionPainting: 多模态融合的自适应注意力应用于 3D 物体检测

FusionPainting: Multimodal Fusion with Adaptive Attention for 3D Object  Detection

In this work, we explore a multimodal semi-supervised learning approach for
punctuation prediction by learning representations from large amounts of
unlabelled audio and text data. Conventional approaches in speech processing
typically use forced alignment to encoder per frame acoustic features to word
level features and perform multimodal fusion of the resulting acoustic and
lexical representations. As an alternative, we explore attention based
multimodal fusion and compare its performance with forced alignment based
fusion. Experiments conducted on the Fisher corpus show that our proposed
approach achieves ~6-9% and ~3-4% absolute improvement (F1 score) over the
baseline BLSTM model on reference transcripts and ASR outputs respectively. We
further improve the model robustness to ASR errors by performing data
augmentation with N-best lists which achieves up to an additional ~2-6%
improvement on ASR outputs. We also demonstrate the effectiveness of
semi-supervised learning approach by performing ablation study on various sizes
of the corpus. When trained on 1 hour of speech and text data, the proposed
model achieved ~9-18% absolute improvement over baseline model.

本研究探索一种多模态半监督学习方法，通过学习大量无标签的音频和文本数据来预测标点符号。实验结果表明，使用注意力机制的多模态融合相对于使用强制对齐的多模态融合可以使基线模型分别在参考转录和自动语音识别输出上达到约 6-9％和 3-4％的绝对改进（F1 分数），数据增广也可以使模型对 ASR 错误更加鲁棒。