Large Language Models (LLMs) have demonstrated considerable advances, and several claims have been made about their exceeding human performance. However, in real-world tasks, domain knowledge is often required. Low-resource learning methods like Active Learning (AL) have been proposed to tackle the cost of domain expert annotation, raising this question: Can LLMs surpass compact models trained with expert annotations in domain-specific tasks? In this work, we conduct an empirical experiment on four datasets from three different domains comparing SOTA LLMs with small models trained on expert annotations with AL. We found that small models can outperform GPT-3.5 with a few hundreds of labeled data, and they achieve higher or similar performance with GPT-4 despite that they are hundreds time smaller. Based on these findings, we posit that LLM predictions can be used as a warmup method in real-world applications and human experts remain indispensable in tasks involving data annotation driven by domain-specific knowledge.

基于四个不同领域的实验结果，本研究发现，小模型在专家注释的情况下能够以较少标注数据的情况下胜过GPT-3.5，并且与GPT-4在性能上达到或超过其，尽管小模型的规模只有后者的百分之一。因此，我们认为在真实世界的应用中，大型语言模型的预测结果可以作为预热方法，并且通过领域专家的数据注释，实现任务的成功。

人类仍胜过LLM：一个关于特定领域注释任务的主动学习的实证研究