Recent developments in large language models (LLMs) have shown promise in
enhancing the capabilities of natural language processing (NLP). Despite these
successes, there remains a dearth of research dedicated to the NLP
problem-solving abilities of LLMs. To fill the gap in this area, we present a
unique benchmarking dataset, NLPBench, comprising 378 college-level NLP
questions spanning various NLP topics sourced from Yale University's prior
final exams. NLPBench includes questions with context, in which multiple
sub-questions share the same public information, and diverse question types,
including multiple choice, short answer, and math. Our evaluation, centered on
LLMs such as GPT-3.5/4, PaLM-2, and LLAMA-2, incorporates advanced prompting
strategies like the chain-of-thought (CoT) and tree-of-thought (ToT). Our study
reveals that the effectiveness of the advanced prompting strategies can be
inconsistent, occasionally damaging LLM performance, especially in smaller
models like the LLAMA-2 (13b). Furthermore, our manual assessment illuminated
specific shortcomings in LLMs' scientific problem-solving skills, with
weaknesses in logical decomposition and reasoning notably affecting results.

通过独特的基准数据集 NLPBench，评估了大型语言模型在自然语言处理中的问题解决能力，并发现高级提示策略的有效性不稳定，对 LLMs 性能有时造成损害，尤其是较小的模型 LLAMA-2（13 亿参数）中表现更明显；同时发现大型语言模型在科学问题解决能力方面存在特定的不足，逻辑分解和推理的薄弱性明显影响结果。