Deep reinforcement learning is a promising approach to training a dialog
manager, but current methods struggle with the large state and action spaces of
multi-domain dialog systems. Building upon Deep Q-learning from Demonstrations
(DQfD), an algorithm that scores highly in difficult Atari games, we leverage
dialog data to guide the agent to successfully respond to a user's requests. We
make progressively fewer assumptions about the data needed, using labeled,
reduced-labeled, and even unlabeled data to train expert demonstrators. We
introduce Reinforced Fine-tune Learning, an extension to DQfD, enabling us to
overcome the domain gap between the datasets and the environment. Experiments
in a challenging multi-domain dialog system framework validate our approaches,
and get high success rates even when trained on out-of-domain data.

本研究提出一种基于 Deep Q-learning from Demonstrations 的 Reinforced Fine-tune Learning 方法，利用 labeled、reduced-labeled 和 unlabeled data 训练 expert demonstrators，以解决多领域对话系统中 state 和 action 空间较大的问题，并在实验中取得了较高的成功率。

从弱演示中学习对话策略

Learning Dialog Policies from Weak Demonstrations

Deep reinforcement learning (RL) has achieved several high profile successes
in difficult decision-making problems. However, these algorithms typically
require a huge amount of data before they reach reasonable performance. In
fact, their performance during learning can be extremely poor. This may be
acceptable for a simulator, but it severely limits the applicability of deep RL
to many real-world tasks, where the agent must learn in the real environment.
In this paper we study a setting where the agent may access data from previous
control of the system. We present an algorithm, Deep Q-learning from
Demonstrations (DQfD), that leverages small sets of demonstration data to
massively accelerate the learning process even from relatively small amounts of
demonstration data and is able to automatically assess the necessary ratio of
demonstration data while learning thanks to a prioritized replay mechanism.
DQfD works by combining temporal difference updates with supervised
classification of the demonstrator's actions. We show that DQfD has better
initial performance than Prioritized Dueling Double Deep Q-Networks (PDD DQN)
as it starts with better scores on the first million steps on 41 of 42 games
and on average it takes PDD DQN 83 million steps to catch up to DQfD's
performance. DQfD learns to out-perform the best demonstration given in 14 of
42 games. In addition, DQfD leverages human demonstrations to achieve
state-of-the-art results for 11 games. Finally, we show that DQfD performs
better than three related algorithms for incorporating demonstration data into
DQN.

本文介绍了 Deep Q-learning from Demonstrations（DQfD）算法，并探究其应用于真实环境下学习任务的可行性以及其在模拟环境和真实环境中的表现；同时，DQfD 算法通过采用优先重放机制以及组合时差更新和监督学习来利用少量演示数据显著加速学习过程。实验表明，DQfD 相较于其他三种相关算法在学习任务中具有更好的表现，并可通过人类演示数据来实现一些领先于其他算法的新的最优成果。