Recent work has improved language models (LMs) remarkably by equipping them
with a non-parametric memory component. However, most existing approaches only
introduce mem-ories at testing time or represent them using a separately
trained encoder, resulting in suboptimal training of the l