An information-theoretic framework is presented for estimating the number of labeled samples needed to train a classifier in a parametric Bayesian setting. Ideas from rate-distortion theory are used to derive bounds on the average $L_1$ or $L_\infty$ distance between the learned classifier and the true maximum a posteriori classifier---which are well-established surrogates for the excess classification error due to imperfect learning---in terms of the differential entropy of the posterior distribution, the Fisher information of the parametric family, and the number of training samples available. The maximum {\em a posteriori} classifier is viewed as a random source, labeled training data are viewed as a finite-rate encoding of the source, and the $L_1$ or $L_\infty$ Bayes risk is viewed as the average distortion. The result is a complementary framework to the well-known probably approximately correct (PAC) framework. PAC bounds characterize worst-case learning performance of a family of classifiers whose complexity is captured by the Vapnik-Chervonenkis (VC) dimension. The rate-distortion framework, on the other hand, characterizes the average-case performance of a family of data distributions in terms of a quantity called the interpolation dimension, which represents the complexity of the family of data distributions. The resulting bounds do not suffer from the pessimism typical of the PAC framework, particularly when the training set is small. The framework also naturally accommodates multi-class settings. Furthermore, Monte Carlo methods provide accurate estimates of the bounds even for complicated distributions. The effectiveness of this framework is demonstrated in both a binary and multi-class Gaussian setting.

本文提出了一个信息理论框架，用于评估在参数化贝叶斯设置下训练分类器所需的标记样本数量，并使用$L_p$距离导出分类器和真实后验概率分类器之间的平均距离的上下界，并利用$ L_p $丢失作为畸变度量，以后验分布的微分熵和插值维度的数量为最大先验分类器提供了下界和上界，这表征了参数分布族的复杂性，同时提供了计算贝叶斯$L_p$风险的下界，是可能近似正确（PAC）框架的补充，该框架提供了涉及Vapnik-Chervonenkis维度或Rademacher复杂性的最小极大风险界，而所提出的速率-失真框架则为数据分布平均的风险提供了下界。

监督学习中贝叶斯风险的速率失真界限