There is a fundamental gap between how humans understand and use language -- in open-ended, real-world situations -- and today's NLP benchmarks for language understanding. To narrow this gap, we propose to evaluate machines by their success at real-world language use -- which greatly expands the scope of language tasks that can be measured and studied. We in