Existing benchmarks used to evaluate the performance of end-to-end neural
dialog systems lack a key component: natural variation present in human
conversations. Most datasets are constructed through crowdsourcing, where the
crowd workers follow a fixed template of instructions while en