Current pre-trained language models have enabled remarkable improvements in
downstream tasks, but it remains difficult to distinguish effects of
statistical correlation from more systematic logical reasoning grounded on the
understanding of real world. We tease these factors apart by l