Popularized as 'bottom-up' attention, bounding box (or region) based visual
features have recently surpassed vanilla grid-based convolutional features as
the de facto standard for vision and language tasks like visual question
answering (VQA). However, it is not clear whether the advantages of regions
(e.g. better localization) are the key reasons for the success of bottom-up
attention. In this paper, we revisit grid features for VQA, and find they can
work surprisingly well - running more than an order of magnitude faster with
the same accuracy (e.g. if pre-trained in a similar fashion). Through extensive
experiments, we verify that this observation holds true across different VQA
models (reporting a state-of-the-art accuracy on VQA 2.0 test-std, 72.71),
datasets, and generalizes well to other tasks like image captioning. As grid
features make the model design and training process much simpler, this enables
us to train them end-to-end and also use a more flexible network design. We
learn VQA models end-to-end, from pixels directly to answers, and show that
strong performance is achievable without using any region annotations in
pre-training. We hope our findings help further improve the scientific
understanding and the practical application of VQA. Code and features will be
made available.

本文探讨了基于 Bounding Box/Region 的 Bottom-up attention 方法是否是解决 Vision and Language 任务（如视觉问答（VQA））成功的关键因素，结果发现与 grid features 方法相比，前者的优势并非是最重要的。同时，grid features 方法设计和训练更加简单，使用更加灵活，并且能够进行端到端训 练，不需要 region 标注，实现了直接从像素到答案的学习。

为视觉问答辩护的网格特征

In Defense of Grid Features for Visual Question Answering

Existing attention mechanisms either attend to local image grid or object
level features for Visual Question Answering (VQA). Motivated by the
observation that questions can relate to both object instances and their parts,
we propose a novel attention mechanism that jointly considers reciprocal
relationships between the two levels of visual details. The bottom-up attention
thus generated is further coalesced with the top-down information to only focus
on the scene elements that are most relevant to a given question. Our design
hierarchically fuses multi-modal information i.e., language, object- and
gird-level features, through an efficient tensor decomposition scheme. The
proposed model improves the state-of-the-art single model performances from
67.9% to 68.2% on VQAv1 and from 65.7% to 67.4% on VQAv2, demonstrating a
significant boost.

该研究提出了一种新的注意力机制，同时考虑视觉细节的两个层次，即物体实例和它们的部分，通过高效的张量分解方案，设计了分层融合多模态信息的模型并提高了已有模型达到了一个显著的提升。

视觉问答的互注意融合

Reciprocal Attention Fusion for Visual Question Answering

This paper presents a state-of-the-art model for visual question answering
(VQA), which won the first place in the 2017 VQA Challenge. VQA is a task of
significant importance for research in artificial intelligence, given its
multimodal nature, clear evaluation protocol, and potential real-world
applications. The performance of deep neural networks for VQA is very dependent
on choices of architectures and hyperparameters. To help further research in
the area, we describe in detail our high-performing, though relatively simple
model. Through a massive exploration of architectures and hyperparameters
representing more than 3,000 GPU-hours, we identified tips and tricks that lead
to its success, namely: sigmoid outputs, soft training targets, image features
from bottom-up attention, gated tanh activations, output embeddings initialized
using GloVe and Google Images, large mini-batches, and smart shuffling of
training data. We provide a detailed analysis of their impact on performance to
assist others in making an appropriate selection.

本文介绍了一个用于视觉问答（VQA）的最先进模型，这个模型在 2017 年的 VQA 挑战中获得了第一名。通过对超过 3,000 个 GPU 小时的架构和超参数的深入探索，我们发现了许多用于提高性能的 Tips and Tricks。我们详细地分析了它们的影响以协助其他人进行适当的选择。

视觉问答技巧：2017 年挑战赛收获

Tips and Tricks for Visual Question Answering: Learnings from the 2017  Challenge

Top-down visual attention mechanisms have been used extensively in image
captioning and visual question answering (VQA) to enable deeper image
understanding through fine-grained analysis and even multiple steps of
reasoning. In this work, we propose a combined bottom-up and top-down attention
mechanism that enables attention to be calculated at the level of objects and
other salient image regions. This is the natural basis for attention to be
considered. Within our approach, the bottom-up mechanism (based on Faster
R-CNN) proposes image regions, each with an associated feature vector, while
the top-down mechanism determines feature weightings. Applying this approach to
image captioning, our results on the MSCOCO test server establish a new
state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of
117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of
the method, applying the same approach to VQA we obtain first place in the 2017
VQA Challenge.

本篇研究提出一种混合自下而上和自上而下视觉关注机制，能在对象和其他显著图像区域的水平上计算注意力权重，实现更深入图像理解，将其应用于图像字幕生成和视觉问答任务中，取得了优于现有技术的成绩。