We study the problem of high-dimensional linear regression in a robust model where an $\epsilon$-fraction of the samples can be adversarially corrupted. We focus on the fundamental setting where the covariates of the uncorrupted samples are drawn from a Gaussian distribution $\mathcal{N}(0, \Sigma)$ on $\mathbb{R}^d$. We give nearly tight upper bounds and computational lower bounds for this problem. Specifically, our main contributions are as follows: For the case that the covariance matrix is known to be the identity, we give a sample near-optimal and computationally efficient algorithm that outputs a candidate hypothesis vector $\widehat{\beta}$ which approximates the unknown regression vector $\beta$ within $\ell_2$-norm $O(\epsilon \log(1/\epsilon) \sigma)$, where $\sigma$ is the standard deviation of the random observation noise. An error of $\Omega (\epsilon \sigma)$ is information-theoretically necessary, even with infinite sample size. Prior work gave an algorithm for this problem with sample complexity $\tilde{\Omega}(d^2/\epsilon^2)$ whose error guarantee scales with the $\ell_2$-norm of $\beta$. For the case of unknown covariance, we show that we can efficiently achieve the same error guarantee as in the known covariance case using an additional $\tilde{O}(d^2/\epsilon^2)$ unlabeled examples. On the other hand, an error of $O(\epsilon \sigma)$ can be information-theoretically attained with $O(d/\epsilon^2)$ samples. We prove a Statistical Query (SQ) lower bound providing evidence that this quadratic tradeoff in the sample size is inherent. More specifically, we show that any polynomial time SQ learning algorithm for robust linear regression (in Huber's contamination model) with estimation complexity $O(d^{2-c})$, where $c>0$ is an arbitrarily small constant, must incur an error of $\Omega(\sqrt{\epsilon} \sigma)$.

研究了高维线性回归在对抗性污染下的稳健模型问题，并针对从高斯分布生成的未被修正的样本的基本情况给出了几乎最紧的上界和计算下界。

关于鲁棒线性回归的高效算法及下界