Visual dialog is a challenging vision-language task, where a dialog agent needs to answer a series of questions through reasoning on the image content and dialog history. Prior work has mostly focused on various attention mechanisms to model such intricate interactions. By contrast, in this work, we propose VD-BERT, a simple yet effective framework of unified vision-dialog Transformer that leverages the pretrained BERT language models for Visual Dialog tasks. The model is unified in that (1) it captures all the interactions between the image and the multi-turn dialog using a single-stream Transformer encoder, and (2) it supports both answer ranking and answer generation seamlessly through the same architecture. More crucially, we adapt BERT for the effective fusion of vision and dialog contents via visually grounded training. Without the need of pretraining on external vision-language data, our model yields new state of the art, achieving the top position in both single-model and ensemble settings (74.54 and 75.35 NDCG scores) on the leaderboard of visual dialog benchmark. We release the code and pretrained models to replicate the results from this paper at https://github.com/yuewang-cuhk/VD-BERT.

该研究提出的VD-BERT框架，是一种简单且有效的视觉-对话Transformer编码器，可以通过统一的编码器捕获图像和多回合对话之间的交互，并通过与BERT语言模型的整合实现回答的排名和生成，同时无需预训练外部视觉-语言数据即可获得最新的最高水平。

VD-BERT: 一个与BERT结合的统一视觉和对话Transformer