Recent breakthroughs in natural language processing (NLP) have been driven by
language models trained on a massive amount of plain text. While powerful,
deriving supervision from textual resources is still an open question. For
example, language model pretraining often neglects the ric