arXiv 2506.00309

Evaluation of LLMs for mathematical problem solving

By Ruonan Wang, Runxi Wang, et al.

Published 2025-05-30

Citation lineage

Review the prior work and downstream research connected to this paper.

Large Language Models (LLMs) have shown impressive performance on a range of educational tasks, but are still understudied for their potential to solve mathematical problems. In this study, we compare three prominent LLMs, including GPT-4o, DeepSeek-V3, and Gemini-2.0, on three mathematics datasets of varying complexities (GSM8K, MATH500, and MIT Open Courseware datasets). We take a five-dimensional approach based o…

View the original paper on arXiv