Humans can collaborate and complete tasks based on visual signals and
instruction from the environment. Training such a robot is difficult especially
due to the understanding of the instruction and the complicated environment.
Previous instruction-following agents are biased to English-centric corpus,
making it unrealizable to be applied to users that use mu