Understanding spatial and visual information is essential for a navigation
agent who follows natural language instructions. The current Transformer-based
VLN agents entangle the orientation and vision information, which limits the
gain from the learning of each information source. In t