Biological research has revealed that the verbal semantic information in the
brain cortex, as an additional source, participates in nonverbal semantic
tasks, such as visual encoding. However, previous visual encoding models did
not incorporate verbal semantic information, contradicting this biological
finding. This paper proposes a multimodal visual information encoding network
model based on stimulus images and associated textual information in response
to this issue. Our visual information encoding network model takes stimulus
images as input and leverages textual information generated by a text-image
generation model as verbal semantic information. This approach injects new
information into the visual encoding model. Subsequently, a Transformer network
aligns image and text feature information, creating a multimodal feature space.
A convolutional network then maps from this multimodal feature space to voxel
space, constructing the multimodal visual information encoding network model.
Experimental results demonstrate that the proposed multimodal visual
information encoding network model outperforms previous models under the exact
training cost. In voxel prediction of the left hemisphere of subject 1's brain,
the performance improves by approximately 15.87%, while in the right
hemisphere, the performance improves by about 4.6%. The multimodal visual
encoding network model exhibits superior encoding performance. Additionally,
ablation experiments indicate that our proposed model better simulates the
brain's visual information processing.

基于刺激图像和相关文本信息的多模态视觉信息编码网络模型，将口头语义信息作为新信息嵌入视觉编码模型，通过 Transformer 网络对图像和文本特征信息进行对齐，构建多模态特征空间。实验结果表明该模型的性能优于先前的模型，并且消融实验证明我们提出的模型更好地模拟了大脑的视觉信息处理。