Core to the vision-and-language navigation (VLN) challenge is building robust instruction representations and action decoding schemes, which can generalize well to previously unseen instructions and environments. In this paper, we report two simple but highly effective methods to address these challenges and lead to a new state-of-the-art performance. First, we adapt large-scale pretrained language models to learn text representations that generalize better to previously unseen instructions. Second, we propose a stochastic sampling scheme to reduce the considerable gap between the expert actions in training and sampled actions in test, so that the agent can learn to correct its own mistakes during long sequential action decoding. Combining the two techniques, we achieve a new state of the art on the Room-to-Room benchmark with 6% absolute gain over the previous best result (47% -> 53%) on the Success Rate weighted by Path Length metric.

本文提出了两种有效方法来改善视觉和语言导航(VLN)挑战中的指令表示和动作解码问题，一是使用大规模预训练语言模型来学习更好的文本表示，二是提出一种随机采样方案来减小训练和测试中动作的差距，从而使智能体可以在长序列的动作解码过程中学习自我纠正，将两项技术结合，成功地在Room-to-Room基准测试中取得了新的最优性能，以路径长度加权的成功率指标提高了6%绝对值(47%—>53%)。

具备语言预训练和随机采样的稳健导航