This paper explores a novel dynamic network for vision and language tasks,
where the inferring structure is customized on the fly for different inputs.
Most previous state-of-the-art approaches are static and hand-crafted networks,
which not only heavily rely on expert knowledge, but also ignore the semantic
diversity of input samples, therefore resulting in suboptimal performance. To
address these issues, we propose a novel Dynamic Transformer Network (DTNet)
for image captioning, which dynamically assigns customized paths to different
samples, leading to discriminative yet accurate captions. Specifically, to
build a rich routing space and improve routing efficiency, we introduce five
types of basic cells and group them into two separate routing spaces according
to their operating domains, i.e., spatial and channel. Then, we design a
Spatial-Channel Joint Router (SCJR), which endows the model with the capability
of path customization based on both spatial and channel information of the
input sample. To validate the effectiveness of our proposed DTNet, we conduct
extensive experiments on the MS-COCO dataset and achieve new state-of-the-art
performance on both the Karpathy split and the online test server.

本文探索一种新颖的动态网络以应对视觉和语言任务，其中推理结构针对不同输入动态定制。通过引入基本单元并在空间和通道运算领域分组，以构建丰富的路径空间和提升路径选择效率，我们设计了一个空间 - 通道联合路由器来根据输入样本的空间和通道信息进行路径定制，并在 MS-COCO 数据集上进行实验证明了提出的动态变压器网络的有效性，获得了 Karpathy 分割和在线测试服务器上的最新最佳性能。

通过动态路径定制实现图像字幕

Image Captioning via Dynamic Path Customization

Current multimodal models, aimed at solving Vision and Language (V+L) tasks,
predominantly repurpose Vision Encoders (VE) as feature extractors. While many
VEs -- of different architectures, trained on different data and objectives --
are publicly available, they are not designed for the downstream V+L tasks.
Nonetheless, most current work assumes that a \textit{single} pre-trained VE
can serve as a general-purpose encoder. In this work, we evaluate whether the
information stored within different VEs is complementary, i.e. if providing the
model with features from multiple VEs can improve the performance on a target
task. We exhaustively experiment with three popular VEs on six downstream V+L
tasks and analyze the attention and VE-dropout patterns. Our results and
analyses suggest that diverse VEs complement each other, resulting in improved
downstream V+L task performance, where the improvements are not due to simple
ensemble effects (i.e. the performance does not always improve when increasing
the number of encoders). We demonstrate that future VEs, which are not
\textit{repurposed}, but explicitly \textit{designed} for V+L tasks, have the
potential of improving performance on the target V+L tasks.

本研究利用三个常用的视觉编码器对六种下游视觉语言任务进行了详细实验，并对注意力机制和编码器 - dropout 模式进行了分析，结果显示不同的视觉编码器互补，可以提高下游视觉语言任务的性能而不是简单的合成效果，且未来的视觉编码器有望提高目标视觉语言任务的性能。

适用于视觉和语言任务的视觉编码器互补性研究

One does not fit all! On the Complementarity of Vision Encoders for Vision and Language Tasks

Object detection plays an important role in current solutions to vision and
language tasks like image captioning and visual question answering. However,
popular models like Faster R-CNN rely on a costly process of annotating
ground-truths for both the bounding boxes and their corresponding semantic
labels, making it less amenable as a primitive task for transfer learning. In
this paper, we examine the effect of decoupling box proposal and featurization
for down-stream tasks. The key insight is that this allows us to leverage a
large amount of labeled annotations that were previously unavailable for
standard object detection benchmarks. Empirically, we demonstrate that this
leads to effective transfer learning and improved image captioning and visual
question answering models, as measured on publicly available benchmarks.

本文研究了对象检测在视觉和语言任务（如图像字幕和视觉问答）中的重要作用以及解耦盒子提议和特征化对下游任务的影响。 实证表明，这导致有效的转移学习和改进的图像字幕和视觉问答模型，以公开可用的基准为衡量。