We investigate the high-dimensional properties of robust regression estimators in the presence of heavy-tailed contamination of both the covariates and response functions. In particular, we provide a sharp asymptotic characterisation of M-estimators trained on a family of elliptical covariate and noise data distributions including cases where second and higher moments do not exist. We show that, despite being consistent, the Huber loss with optimally tuned location parameter $\delta$ is suboptimal in the high-dimensional regime in the presence of heavy-tailed noise, highlighting the necessity of further regularisation to achieve optimal performance. This result also uncovers the existence of a curious transition in $\delta$ as a function of the sample complexity and contamination. Moreover, we derive the decay rates for the excess risk of ridge regression. We show that, while it is both optimal and universal for noise distributions with finite second moment, its decay rate can be considerably faster when the covariates' second moment does not exist. Finally, we show that our formulas readily generalise to a richer family of models and data distributions, such as generalised linear estimation with arbitrary convex regularisation trained on mixture models.

我们研究了在协变量和响应函数都存在重尾污染的情况下, 强鲁棒回归估计器的高维特性。尤其是, 我们针对一族包括无二阶甚至更高阶矩不存在情况下的椭圆形协变量和噪声数据分布, 提供了M-估计的锐性渐近特性描述。我们表明, 尽管具有一致性, 在存在重尾噪声的高维情形中, 优化调整的Huber损失与位置参数δ是次优的, 强调了需要进一步正则化以达到最佳性能的必要性。这个结果还揭示了δ作为样本复杂性和污染的函数的一个有趣的转变的存在。此外, 我们导出了岭回归的超额风险的衰减速率。我们表明, 对于有限二阶矩的噪声分布, 岭回归虽然是最佳的且适用的, 但当协变量的二阶矩不存在时, 它的衰减速率可能会更快。最后, 我们展示了我们的公式可以方便地推广到更丰富的模型和数据分布, 如对混合模型的任意凸正则化训练的广义线性估计。

高维重尾数据下的健壮回归：渐近性和普适性