The repeated community-wide reuse of test sets in popular benchmark problems
raises doubts about the credibility of reported test-error rates. Verifying
whether a learned model is overfitted to a test set is challenging as
independent test sets drawn from the same data distribution are usually
unavailable, while other test sets may introduce a distribution s