computer vision has benefited from initializing multiple deep layers with
weights pretrained on large supervised training sets like ImageNet. Natural
language processing (NLP) typically sees initialization of only the lowest
layer of deep models with pretrained word vectors. In this pa