We present an optimised multi-modal dialogue agent for interactive learning of visually grounded word meanings from a human tutor, trained on real human-human tutoring data. Within a life-long interactive learning period, the agent, trained using Reinforcement Learning (RL), must be able to handle natural conversations with human users and achieve good learning performance (accuracy) while minimising human effort in the learning process. We train and evaluate this system in interaction with a simulated human tutor, which is built on the BURCHAK corpus -- a Human-Human Dialogue dataset for the visual learning task. The results show that: 1) The learned policy can coherently interact with the simulated user to achieve the goal of the task (i.e. learning visual attributes of objects, e.g. colour and shape); and 2) it finds a better trade-off between classifier accuracy and tutoring costs than hand-crafted rule-based policies, including ones with dynamic policies.

本研究基于强化学习模型，针对生动图像作为学习基础的人机交互场景，训练了一种多模态对话代理，并基于BURCHAK语料库对代理进行了交互式学习和评估，在提高分类器准确性的同时，尽量减少学习过程中的人工操作。结果表明，该代理学习策略的性能超过基于手工定制的策略，并能够与人类模拟器有效协同学习。

学习如何学习: 一种用于增量学习视觉定位词义的自适应对话代理