The Guesser plays an important role in GuessWhat?! like visual dialogues. It locates the target object in an image supposed by an oracle oneself over a question-answer based dialogue between a Questioner and the Oracle. Most existing guessers make one and only one guess after receiving all question-answer pairs in a dialogue with predefined number of rounds. This paper proposes the guessing state for the guesser, and regards guess as a process with change of guessing state through a dialogue. A guessing state tracking based guess model is therefore proposed. The guessing state is defined as a distribution on candidate objects in the image. A state update algorithm including three modules is given. UoVR updates the representation of the image according to current guessing state, QAEncoder encodes the question-answer pairs, and UoGS updates the guessing state by combining both information from the image and dialogue history. With the guessing state in hand, two loss functions are defined as supervisions for model training. Early supervision brings supervision to guesser at early rounds, and incremental supervision brings monotonicity to the guessing state. Experimental results on GuessWhat?! dataset show that our model significantly outperforms previous models, achieves new state-of-the-art, especially, the success rate of guessing 83.3% is approaching human-level performance 84.4%.

本文提出了一种猜测状态跟踪的猜测模型，用于GuessWhat？！任务中的视觉定位和对话，以改善现有的猜测器，如Guesser的精度，实验结果显示，该模型在现有模型中表现最佳，猜测成功率达到83.3％，接近人类的84.4％。

视觉对话的猜测状态跟踪