In order to successfully perform tasks specified by natural language
instructions, an artificial agent operating in a visual world needs to map
words, concepts, and actions from the instruction to visual elements in its
environment. This association is termed as task-oriented grounding