Measuring the Gap Between Human and LLM Research Ideas

Ziyu Chen, Yilun Zhao, Arman Cohan

LLMs are increasingly used to brainstorm research ideas, but existing evaluations mostly judge individual ideas by novelty, feasibility, or expert preference. We instead ask: how far are current LLM-generated ideas from human researchers? To characterize this gap, we build a large-scale evaluation framework for ideation from high-quality human research papers. For each paper, we reverse-engineer a small set of closely related prior works that likely inspired its core idea. LLMs are then prompted to generate a new idea from the set of paper titles and summaries. We introduce a two-axis research-taste taxonomy to profile each idea by its opportunity pattern and research paradigm, and use it to quantify the divergence between human and LLM ideas. Across idea sets generated by different LLMs, we observe a consistent distributional gap: LLM ideas are disproportionately concentrated around bridge-like opportunities and synthesis methods, whereas the human paper reference distribution spreads more broadly across ways of framing gaps and constructing contributions. This result suggests that strong LLMs can produce a range of reasonable ideas, but that range remains narrower than, and systematically shifted relative to, human research taste.

Open Source

Research Brief

LLM-generated research ideas consistently demonstrate a narrower and systematically shifted distribution compared to human ideas, primarily focusing on synthesis and 'bridge-like' opportunities.

Large Language Models (LLMs) are increasingly employed for brainstorming research ideas, yet current evaluations often focus on individual idea attributes like novelty or feasibility. This paper introduces a novel approach to assess the fundamental gap between human and LLM creative ideation. The researchers built a large-scale evaluation framework by taking high-quality human research papers, identifying their likely inspiring prior works, and then prompting LLMs to generate new ideas from these same sets of prior works. By profiling each idea using a new two-axis taxonomy of 'research taste' (opportunity pattern and research paradigm), they quantified the divergence. The key finding is a consistent distributional gap: LLMs disproportionately generate ideas that bridge existing concepts or synthesize information, while human researchers demonstrate a much broader range in how they frame problems and construct contributions. This suggests that while LLMs produce reasonable ideas, their creative spectrum remains narrower and systematically different from human research preferences.

Potential Applications

Developing AI tools that specifically target and expand LLM capabilities in areas where human creativity excels, such as novel problem framing or identifying entirely new research paradigms.
Designing advanced benchmarking systems for generative AI in scientific discovery, allowing for a more nuanced comparison against human 'research taste' beyond simple metrics.
Guiding human researchers towards 'white spaces' or less explored avenues in research where current LLMs are least effective, thus optimizing human-AI collaboration.
Informing the training objectives for future LLMs aimed at scientific ideation, to better cultivate a broader and more diverse 'research taste' beyond mere synthesis.

25/100

Paper Trustworthiness Index

High Skepticism

High Skepticism / Self-Published

This document should be treated with critical skepticism. It contains unverified scientific claims or was self-published.

Verified AI Assessment: This credibility analysis was generated by Gemini 2.5 Flash analyzing the full paper text, references, and metadata.

Core Pillars Breakdown

Author & Institutional Track Record

0 / 25

The abstract does not provide any information regarding the authors or their institutional affiliations, making it impossible to assess their track record or academic prestige.

Technical Rigor & Methodology

25 / 30

The paper outlines a robust methodology, including a 'large-scale evaluation framework,' 'reverse-engineering' prior works, and a novel 'two-axis research-taste taxonomy' for profiling and quantifying ideas. This systematic approach suggests strong technical rigor for an evaluation study.

Reproducibility & Openness

0 / 25

The abstract provides no information about the availability of code, datasets, or specific URLs, which are critical for an independent researcher to reproduce the described evaluation framework and findings.

Community Vetting & Peer Review

0 / 20

The abstract does not mention if the paper has undergone peer review, been accepted by a conference or journal, or if it is currently a preprint, making it impossible to assess its community vetting status.

Detailed Evidence Assessment

Verified Evidence & Citations

LLMs are increasingly used to brainstorm research ideas.

“LLMs are increasingly used to brainstorm research ideas”

Existing evaluations mostly judge individual ideas by novelty, feasibility, or expert preference.

“existing evaluations mostly judge individual ideas by novelty, feasibility, or expert preference”

The paper builds a large-scale evaluation framework for ideation from high-quality human research papers.

“we build a large-scale evaluation framework for ideation from high-quality human research papers.”

The paper introduces a two-axis research-taste taxonomy to profile ideas.

“We introduce a two-axis research-taste taxonomy to profile each idea by its opportunity pattern and research paradigm”

A consistent distributional gap is observed between human and LLM ideas.

“Across idea sets generated by different LLMs, we observe a consistent distributional gap”

LLM ideas are disproportionately concentrated around bridge-like opportunities and synthesis methods.

“LLM ideas are disproportionately concentrated around bridge-like opportunities and synthesis methods”

Human paper reference distribution spreads more broadly.

“the human paper reference distribution spreads more broadly across ways of framing gaps and constructing contributions.”

Uncertainties & Omissions

• Omission:Author names and affiliations.

• Omission:Publication venue (conference/journal) or peer-review status.

• Omission:Links to code, datasets, or model weights for reproducibility.

• Omission:Details on the specific LLMs used for idea generation.

• Omission:Quantitative metrics or examples used to 'quantify the divergence'.

• Uncertainty:The precise operationalization and comprehensive validity of the 'two-axis research-taste taxonomy' across diverse research fields.

• Uncertainty:The generalizability of the observed gap to all research domains and an even broader range of LLM architectures and prompting strategies.

• Uncertainty:Whether the 'reverse-engineering' of prior works fully captures the complex and often multi-faceted inspirations behind human research ideas.