We propose a novel algorithm for visual question answering based on a recurrent deep neural network, where every module in the network corresponds to a complete answering unit with attention mechanism by itself. The network is optimized by minimizing loss aggregated from all the units, which share model parameters while receiving different information to compute attention probability. For training, our model attends to a region within image feature map, updates its memory based on the question and attended image feature, and answers the question based on its memory state. This procedure is performed to compute loss in each step. The motivation of this approach is our observation that multi-step inferences are often required to answer questions while each problem may have a unique desirable number of steps, which is difficult to identify in practice. Hence, we always make the first unit in the network solve problems, but allow it to learn the knowledge from the rest of units by backpropagation unless it degrades the model. To implement this idea, we early-stop training each unit as soon as it starts to overfit. Note that, since more complex models tend to overfit on easier questions quickly, the last answering unit in the unfolded recurrent neural network is typically killed first while the first one remains last. We make a single-step prediction for a new question using the shared model. This strategy works better than the other options within our framework since the selected model is trained effectively from all units without overfitting. The proposed algorithm achieves the state-of-the-art performance on the standard benchmark dataset without data augmentation.

本文提出了一种基于循环深度神经网络的视觉问答算法，其中每个模块都对应着一个自身具备注意机制的完整的回答单元；该网络的优化基于所有单元汇聚起来的损失函数，同时这些单元共享模型参数，并接收不同信息来计算注意概率。实验结果表明，该算法在 VQA 数据集上的性能优于其他多步注意力机制。

联合损失最小化训练具有循环应答单元的VQA模型