visually-situated language is ubiquitous -- sources range from textbooks with
diagrams to web pages with images and tables, to mobile apps with buttons and
forms. Perhaps due to this diversity, previous work has typically relied on
domain-specific recipes with limited sharing of the un