arXiv 2512.15567

Evaluating Large Language Models in Scientific Discovery

By Zhangde Song, Jieyu Lu, et al.

Published 2025-12-17

Mindmap

Browse the paper's core ideas, clusters, and relationships in a structured outline.

Large language models (LLMs) are increasingly applied to scientific research, yet prevailing science benchmarks probe decontextualized knowledge and overlook the iterative reasoning, hypothesis generation, and observation interpretation that drive scientific discovery. We introduce a scenario-grounded benchmark that evaluates LLMs across biology, chemistry, materials, and physics, where domain experts define researc…

View the original paper on arXiv