arXiv 2404.05692
Evaluating Mathematical Reasoning Beyond Accuracy
By Shijie Xia, Xuefeng Li, et al.
Published 2024-04-08
Citation lineage
Review the prior work and downstream research connected to this paper.
The leaderboard of Large Language Models (LLMs) in mathematical tasks has been continuously updated. However, the majority of evaluations focus solely on the final results, neglecting the quality of the intermediate steps. This oversight can mask underlying problems, such as logical errors or unnecessary steps in the reasoning process. To measure reasoning beyond final-answer accuracy, we introduce ReasonEval, a new…