Datasets that are terabytes in size are increasingly common, but computer
bottlenecks often frustrate a complete analysis of the data. While more data
are better than less, diminishing returns suggest that we may not need
terabytes of data to estimate a parameter or test a hypothesis. But which rows
of data should we analyze, and might an arbitrary subset of