Leveraging the synergy of both audio data and visual data is essential for
understanding human emotions and behaviors, especially in in-the-wild setting.
Traditional methods for integrating such multimodal information often stumble,
leading to less-than-ideal outcomes in the task of facial action unit
detection. To overcome these shortcomings, we propose a novel approach
utilizing audio-visual multimodal data. This method enhances audio feature
extraction by leveraging Mel Frequency Cepstral Coefficients (MFCC) and Log-Mel
spectrogram features alongside a pre-trained VGGish network. Moreover, this
paper adaptively captures fusion features across modalities by modeling the
temporal relationships, and ultilizes a pre-trained GPT-2 model for
sophisticated context-aware fusion of multimodal information. Our method
notably improves the accuracy of AU detection by understanding the temporal and
contextual nuances of the data, showcasing significant advancements in the
comprehension of intricate scenarios. These findings underscore the potential
of integrating temporal dynamics and contextual interpretation, paving the way
for future research endeavors.

利用音频数据和视觉数据协同作用对于理解人类情绪和行为非常重要，本论文提出了一种利用音视频多模态数据的新方法，通过模型化时间关系和利用预训练的 GPT-2 模型进行上下文感知的多模态信息融合，显著提高了面部动作单元检测的准确性，突显了对复杂场景理解的重要进展，为未来研究铺平了道路。

AUD-TGN：在野外音频视觉环境中借助时间卷积和 GPT-2 推进动作单位检测

AUD-TGN: Advancing Action Unit Detection with Temporal Convolution and  GPT-2 in Wild Audiovisual Contexts

Recovering the 3D representation of an object from single-view or multi-view
RGB images by deep neural networks has attracted increasing attention in the
past few years. Several mainstream works (e.g., 3D-R2N2) use recurrent neural
networks (RNNs) to fuse multiple feature maps extracted from input images
sequentially. However, when given the same set of input images with different
orders, RNN-based approaches are unable to produce consistent reconstruction
results. Moreover, due to long-term memory loss, RNNs cannot fully exploit
input images to refine reconstruction results. To solve these problems, we
propose a novel framework for single-view and multi-view 3D reconstruction,
named Pix2Vox. By using a well-designed encoder-decoder, it generates a coarse
3D volume from each input image. Then, a context-aware fusion module is
introduced to adaptively select high-quality reconstructions for each part
(e.g., table legs) from different coarse 3D volumes to obtain a fused 3D
volume. Finally, a refiner further refines the fused 3D volume to generate the
final output. Experimental results on the ShapeNet and Pix3D benchmarks
indicate that the proposed Pix2Vox outperforms state-of-the-arts by a large
margin. Furthermore, the proposed method is 24 times faster than 3D-R2N2 in
terms of backward inference time. The experiments on ShapeNet unseen 3D
categories have shown the superior generalization abilities of our method.

提出了一种名为 Pix2Vox 的新框架，采用精心设计的编码器 - 解码器生成每个图像的粗略 3D 体积，再引入上下文感知融合模块自适应地选择不同粗略 3D 体积中每个部分（例如桌腿）的高质量重构，以获得融合的 3D 体积，并通过一个细化器进一步精化融合的 3D 体积以生成最终输出，该方法在 3D 重建方面的实验结果表明，Pix2Vox 不仅性能较其它现有算法更优，而且退推时间比 3D-R2N2 快 24 倍，而且该方法具有强大的通用性。