arXiv 2512.15567

Evaluating Large Language Models in Scientific Discovery

By Zhangde Song, Jieyu Lu, et al.

Published 2025-12-17

Discussion

Read the public discussion and references gathered around this paper.

Large language models (LLMs) are increasingly applied to scientific research, yet prevailing science benchmarks probe decontextualized knowledge and overlook the iterative reasoning, hypothesis generation, and observation interpretation that drive scientific discovery. We introduce a scenario-grounded benchmark that evaluates LLMs across biology, chemistry, materials, and physics, where domain experts define researc…

View the original paper on arXiv