arXiv 2404.05692

Evaluating Mathematical Reasoning Beyond Accuracy

By Shijie Xia, Xuefeng Li, et al.

Published 2024-04-08

Citation lineage

Review the prior work and downstream research connected to this paper.

The leaderboard of Large Language Models (LLMs) in mathematical tasks has been continuously updated. However, the majority of evaluations focus solely on the final results, neglecting the quality of the intermediate steps. This oversight can mask underlying problems, such as logical errors or unnecessary steps in the reasoning process. To measure reasoning beyond final-answer accuracy, we introduce ReasonEval, a new…

View the original paper on arXiv