We explore the properties of byte-level recurrent language models. When given
sufficient amounts of capacity, training data, and compute time, the
representations learned by these models include disentangled features
corresponding to high-level concepts. Specifically, we find a single