Following a navigation instruction such as 'Walk down the stairs and stop at the brown sofa' requires embodied ai agents to ground scene elements referenced via language (e.g. 'stairs') to visual content in the environment (pixels corresponding to 'stairs'). We ask the following questi