Modern machine learning models are complex and frequently encode surprising amounts of information about individual inputs. In extreme cases, complex models appear to memorize entire input examples, including seemingly irrelevant information (social security numbers from text, for example). In this paper, we aim to understand whether this sort of memorization is necessary for accurate learning. We describe natural prediction problems in which every sufficiently accurate training algorithm must encode, in the prediction model, essentially all the information about a large subset of its training examples. This remains true even when the examples are high-dimensional and have entropy much higher than the sample size, and even when most of that information is ultimately irrelevant to the task at hand. Further, our results do not depend on the training algorithm or the class of models used for learning. Our problems are simple and fairly natural variants of the next-symbol prediction and the cluster labeling tasks. These tasks can be seen as abstractions of image- and text-related prediction problems. To establish our results, we reduce from a family of one-way communication problems for which we prove new information complexity lower bounds.

本研究旨在探讨现代机器学习模型是否必须要记忆所有训练样本中的信息才能够准确学习，对于这个问题，我们提出了两个预测问题的简单变体并进行了探究，结果表明，即使高维度的样本信息熵远高于样本数且其中的大部分信息与任务无关，每个准确的训练算法必须在其预测模型中编码所有有关大元素集合的信息，而这也不受算法或学习模型类别的影响。

何时需要记忆不相关的训练数据以实现高准确度学习？