It is still challenging to build an ai system that can perform tasks that
involve vision and language at human level. So far, researchers have singled
out individual tasks separately, for each of which they have designed networks
and trained them on its dedicated datasets. Although thi