This study explores innovative methods for improving Visual Question Answering (VQA) using Generative Adversarial Networks (GANs), autoencoders, and attention mechanisms. Leveraging a balanced VQA dataset, we investigate three distinct strategies. Firstly, GAN-based approaches aim to generate answer embeddings conditioned on image and question inputs, showing potential but struggling with more complex tasks. Secondly, autoencoder-based techniques focus on learning optimal embeddings for questions and images, achieving comparable results with GAN due to better ability on complex questions. Lastly, attention mechanisms, incorporating Multimodal Compact Bilinear pooling (MCB), address language priors and attention modeling, albeit with a complexity-performance trade-off. This study underscores the challenges and opportunities in VQA and suggests avenues for future research, including alternative GAN formulations and attentional mechanisms.

本研究探讨了使用生成对抗网络（GANs）、自编码器和注意力机制改进视觉问答（VQA）的创新方法。研究发现，通过利用一个平衡的VQA数据集，GAN-based方法提供了生成与图像和问题相关的答案嵌入的潜力，但在处理较复杂的任务时存在困难。相比之下，基于自编码器的技术专注于学习问题和图像的最佳嵌入，由于在处理复杂问题上能力更强，其结果与GAN-based方法相媲美。最后，注意力机制结合多模态紧凑双线性池化（MCB）来解决语言先验和注意力建模问题，但在复杂性和性能之间需要权衡。本研究强调了VQA领域的挑战和机遇，并提出了未来研究的方向，包括替代GAN的形式和注意力机制。

探索多种方法在视觉问答中的应用