Recent work has demonstrated substantial gains in pre-training large-scale unidirectional language models such as the GPT-2, GPT-3, and GPT-neo, followed by fine-tuning on a downstream task. In this paper, we evaluate the performance of the GPT-neo 1.3 billion model for commonsense reasoning tasks. We assess the model performance on six commonsense reasoning benchmark tasks and report the accuracy scores for these tasks. When fine-tuned using the right set of hyperparameters, we obtain competitive scores on three of these tasks but struggle when the dataset size is significantly smaller. The low model performance on a few of these tasks suggests the inherent difficulty in these datasets and since it fails to establish coherent patterns given their limited training samples. We also investigate and substantiate our results using visualization and conduct numerous inference tests to understand the model performance better. Finally, we conduct thorough robustness tests using various methods to gauge the model performance under numerous settings. These findings suggest a promising path for exploring smaller language models than the GPT-3 175 billion model to perform tasks requiring natural language understanding.

本文评估了GPT-neo 1.3亿模型在常识推理任务上的表现，发现模型在某些任务上具有竞争力，但当数据集大小显著较小时表现会很差。研究者还使用可视化和推理测试来证实结果，并通过多种方法进行彻底的健壮性测试。

GPT-Neo用于常识推理——理论和实践视角