In this work, we study the problem of Embodied Referring Expression
Grounding, where an agent needs to navigate in a previously unseen environment
and localize a remote object described by a concise high-level natural language
instruction. When facing such a situation, a human tends to imagine what the
destination may look like and to explore the environment