There is a fundamental gap between how humans understand and use language -- in open-ended, real-world situations -- and today's NLP benchmarks for language understanding. To narrow this gap, we propose to evaluate machines by their success at real-world language use -- which greatly expands the scope of language tasks that can be measured and studied. We introduce TuringAdvice, a new challenge for language understanding systems. Given a complex situation faced by a real person, a machine must generate helpful advice. We make our challenge concrete by introducing RedditAdvice, a dataset and leaderboard for measuring progress. Though we release a training set with 600k examples, our evaluation is dynamic, continually evolving with the language people use: models must generate helpful advice for recently-written situations. Empirical results show that today's models struggle at our task, even those with billions of parameters. The best model, a finetuned T5, writes advice that is at least as helpful as human-written advice in only 9% of cases. This low performance reveals language understanding errors that are hard to spot outside of a generative setting, showing much room for progress.

提出了一个名为TuringAdvice的任务和对应数据集，对自然语言生成（NLG）的语言理解模型进行了测试，实证结果表明目前的NLG模型在此任务上表现不佳，仅有14％的情况下能够输出至少与人类撰写的建议同等有用，这反映出在生成性环境下难以发现的语言理解错误，仍有大量进展空间。

TuringAdvice：语言使用的生成和动态评估