arXiv 2506.00309
Evaluation of LLMs for mathematical problem solving
By Ruonan Wang, Runxi Wang, et al.
Published 2025-05-30
Citation lineage
Review the prior work and downstream research connected to this paper.
Large Language Models (LLMs) have shown impressive performance on a range of educational tasks, but are still understudied for their potential to solve mathematical problems. In this study, we compare three prominent LLMs, including GPT-4o, DeepSeek-V3, and Gemini-2.0, on three mathematics datasets of varying complexities (GSM8K, MATH500, and MIT Open Courseware datasets). We take a five-dimensional approach based o…