Recent advancements in large language models (LLMs) have sparked optimism about their potential to accelerate scientific discovery, with a growing number of works proposing research agents that autonomously generate and validate new ideas. Despite this, no evaluations have shown that LLM systems can take the very first step of producing novel, expert-level ideas, let alone perform the entire research process. We address this by establishing an experimental design that evaluates research idea generation while controlling for confounders and performs the first head-to-head comparison between expert NLP researchers and an LLM ideation agent. By recruiting over 100 NLP researchers to write novel ideas and blind reviews of both LLM and human ideas, we obtain the first statistically significant conclusion on current LLM capabilities for research ideation: we find LLM-generated ideas are judged as more novel (p < 0.05) than human expert ideas while being judged slightly weaker on feasibility. Studying our agent baselines closely, we identify open problems in building and evaluating research agents, including failures of LLM self-evaluation and their lack of diversity in generation. Finally, we acknowledge that human judgements of novelty can be difficult, even by experts, and propose an end-to-end study design which recruits researchers to execute these ideas into full projects, enabling us to study whether these novelty and feasibility judgements result in meaningful differences in research outcome.

本研究解决了大型语言模型（LLMs）在生成新颖、高水平研究创意方面的能力不足的问题。通过对100多名自然语言处理研究者进行实验设计，我们首次对LLM和人类创意进行了头对头的比较，发现LLM生成的创意在新颖性上被评判为优于人类专家的创意。研究还揭示了构建和评估研究代理面临的开放问题，并提出进一步研究的必要性。

大型语言模型能否生成新颖的研究创意？一项涵盖100多名自然语言处理研究者的大规模人类研究