In this paper, we present our solution for SMART-101 Challenge of CVPR
Multi-modal Algorithmic Reasoning Task 2024. Unlike traditional visual
questions and answer tasks, this challenge evaluates abstraction, deduction and
generalization ability of neural network in solving visuo-linguistic puzzles
designed for specially children in the 6-8 age group. Our model is based on two
pre-trained models, dedicated to extract features from text and image
respectively. To integrate the features from different modalities, we employed
a fusion layer with attention mechanism. We explored different text and image
pre-trained models, and fine-tune the integrated classifier on the SMART-101
dataset. Experiment results show that under the data splitting style of puzzle
split, our proposed integrated classifier achieves superior performance,
verifying the effectiveness of multi-modal pre-trained representations.

我们提出了一个基于多模态算法推理的神经网络解决方案，用于解决专为 6-8 岁儿童设计的视觉语言难题，我们的模型基于两个预训练模型，分别从文本和图像中提取特征，并通过融合层和注意机制进行特征整合。实验结果表明，在智能挑战数据集的拼图分割样式下，我们提出的综合分类器具有卓越的性能，验证了多模态预训练表示的有效性。