Face Recognition models are commonly trained with web-scraped datasets containing millions of images and evaluated on test sets emphasizing pose, age and mixed attributes. With train and test sets both assembled from web-scraped images, it is critical to ensure disjoint sets of identities between train and test sets. However, existing train and test sets have not considered this. Moreover, as accuracy levels become saturated, such as LFW $>99.8\%$, more challenging test sets are needed. We show that current train and test sets are generally not identity- or even image-disjoint, and that this results in an optimistic bias in the estimated accuracy. In addition, we show that identity-disjoint folds are important in the 10-fold cross-validation estimate of test accuracy. To better support continued advances in face recognition, we introduce two "Goldilocks" test sets, Hadrian and Eclipse. The former emphasizes challenging facial hairstyles and latter emphasizes challenging over- and under-exposure conditions. Images in both datasets are from a large, controlled-acquisition (not web-scraped) dataset, so they are identity- and image-disjoint with all popular training sets. Accuracy for these new test sets generally falls below that observed on LFW, CPLFW, CALFW, CFP-FP and AgeDB-30, showing that these datasets represent important dimensions for improvement of face recognition. The datasets are available at: \url{https://github.com/HaiyuWu/SOTA-Face-Recognition-Train-and-Test}

人脸识别模型通常使用百万级图像的网络抓取数据集进行训练，并在强调姿势、年龄和混合特征的测试集上评估。然而，在由网络抓取图像组成的训练集和测试集中，确保身份集合之间的不重合至关重要。本研究发现当前的训练集和测试集通常不是身份或图像不相交的，导致在估计准确度时存在乐观偏差。为了支持人脸识别的不断进步，本研究介绍了两个挑战性的测试集：“Hadrian” 强调具有挑战性的面部发型，“Eclipse” 强调具有挑战性的过度曝光和欠曝光条件。这些数据集在标准训练集中是身份和图像不相交的，新测试集的准确度通常低于 LFW、CPLFW、CALFW、CFP-FP 和 AgeDB-30 等，表明这些数据集代表了人脸识别改进的重要维度。

什么是合适的人脸验证测试集？