Image-guided story ending generation (IgSEG) is to generate a story ending
based on given story plots and ending image. Existing methods focus on
cross-modal feature fusion but overlook reasoning and mining implicit
information from story plots and ending image. To tackle this drawback, we
propose a multimodal event transformer, an event-based reasoning framework for
IgSEG. Specifically, we construct visual and semantic event graphs from story
plots and ending image, and leverage event-based reasoning to reason and mine
implicit information in a single modality. Next, we connect visual and semantic
event graphs and utilize cross-modal fusion to integrate different-modality
features. In addition, we propose a multimodal injector to adaptive pass
essential information to decoder. Besides, we present an incoherence detection
to enhance the understanding context of a story plot and the robustness of
graph modeling for our model. Experimental results show that our method
achieves state-of-the-art performance for the image-guided story ending
generation.

提出了一种基于多模态事件转换器的图像引导故事结尾生成方法，该方法利用事件图、跨模态融合和事件推理等技术从故事情节和结尾图像中推导隐含信息，并在解码器中适应性地注入必要信息，实验证明其在故事结尾生成方面性能优于现有方法。

多模态事件转换器用于图像引导的故事结尾生成

Multimodal Event Transformer for Image-guided Story Ending Generation

Image-guided depth completion aims to generate dense depth maps with sparse
depth measurements and corresponding RGB images. Currently, spatial propagation
networks (SPNs) are the most popular affinity-based methods in depth
completion, but they still suffer from the representation limitation of the
fixed affinity and the over smoothing during iterations. Our solution is to
estimate independent affinity matrices in each SPN iteration, but it is
over-parameterized and heavy calculation. This paper introduces an efficient
model that learns the affinity among neighboring pixels with an
attention-based, dynamic approach. Specifically, the Dynamic Spatial
Propagation Network (DySPN) we proposed makes use of a non-linear propagation
model (NLPM). It decouples the neighborhood into parts regarding to different
distances and recursively generates independent attention maps to refine these
parts into adaptive affinity matrices. Furthermore, we adopt a diffusion
suppression (DS) operation so that the model converges at an early stage to
prevent over-smoothing of dense depth. Finally, in order to decrease the
computational cost required, we also introduce three variations that reduce the
amount of neighbors and attentions needed while still retaining similar
accuracy. In practice, our method requires less iteration to match the
performance of other SPNs and yields better results overall. DySPN outperforms
other state-of-the-art (SoTA) methods on KITTI Depth Completion (DC) evaluation
by the time of submission and is able to yield SoTA performance in NYU Depth v2
dataset as well.

本文提出了一种名为 DySPN 的动态空间传播网络，通过注意力机制学习像素之间的关联性，以生成 RGB 图像的密集深度图，并采用扩散抑制技术防止过度平滑。实验结果表明，在 KITTI Depth Completion 和 NYU Depth v2 数据集上 DySPN 表现优于其他 SoTA 方法。

基于动态空间传播网络的深度填充

Dynamic Spatial Propagation Network for Depth Completion

We propose DeepHuman, an image-guided volume-to-volume translation CNN for 3D
human reconstruction from a single RGB image. To reduce the ambiguities
associated with the surface geometry reconstruction, even for the
reconstruction of invisible areas, we propose and leverage a dense semantic
representation generated from SMPL model as an additional input. One key
feature of our network is that it fuses different scales of image features into
the 3D space through volumetric feature transformation, which helps to recover
accurate surface geometry. The visible surface details are further refined
through a normal refinement network, which can be concatenated with the volume
generation network using our proposed volumetric normal projection layer. We
also contribute THuman, a 3D real-world human model dataset containing about
7000 models. The network is trained using training data generated from the
dataset. Overall, due to the specific design of our network and the diversity
in our dataset, our method enables 3D human model estimation given only a
single image and outperforms state-of-the-art approaches.

提出了一种用于从单个 RGB 图像进行 3D 人体重建的基于图像引导的体积到体积的转换 CNN，通过 SMPL 模型生成密集的语义表示来减少与表面几何重建相关的歧义，将不同尺度的图像特征融合到三维空间中，并通过法线细化网络来进一步提高可见面细节的精度，使用所提出的体积法向投影层将其连接到体积生成网络中，并使用其中包含的 3D 真实世界人体模型数据集进行训练，实验表明该方法在前沿方法中表现优异。