Standard accuracy metrics indicate that reading comprehension systems are
making rapid progress, but the extent to which these systems truly understand
language remains unclear. To reward systems with real language unde
SQuAD 2.0 is a dataset that combines existing SQuAD data with over 50,000 unanswerable questions to test extractive reading comprehension systems' abilities to determine when no answer is supported, resulting in a challenging natural language understanding task for existing models that previously achieved only 66% F1 on SQuAD 2.0.