In line with the human capacity to perceive the world by simultaneously
processing and integrating high-dimensional inputs from multiple modalities
like vision and audio, we propose a novel model, MAiVAR-T (Multimodal
Audio-Image to Video Action Recognition Transformer). This model employs an
intuitive approach for the combination of audio-image and video modalities,
with a primary aim to escalate the effectiveness of multimodal human action
recognition (MHAR). At the core of MAiVAR-T lies the significance of distilling
substantial representations from the audio modality and transmuting these into
the image domain. Subsequently, this audio-image depiction is fused with the
video modality to formulate a unified representation. This concerted approach
strives to exploit the contextual richness inherent in both audio and video
modalities, thereby promoting action recognition. In contrast to existing
state-of-the-art strategies that focus solely on audio or video modalities,
MAiVAR-T demonstrates superior performance. Our extensive empirical evaluations
conducted on a benchmark action recognition dataset corroborate the model's
remarkable performance. This underscores the potential enhancements derived
from integrating audio and video modalities for action recognition purposes.

提出了一种新模型 MAiVAR-T（Multimodal Audio-Image to Video Action Recognition Transformer），旨在融合音频和图像模态以提高多模态人体动作识别（MHAR）的效果，并在基准动作识别数据集上展示了卓越的性能。