arXiv 2404.05692

Evaluating Mathematical Reasoning Beyond Accuracy

By Shijie Xia, Xuefeng Li, et al.

Published 2024-04-08

Mindmap

Browse the paper's core ideas, clusters, and relationships in a structured outline.

The leaderboard of Large Language Models (LLMs) in mathematical tasks has been continuously updated. However, the majority of evaluations focus solely on the final results, neglecting the quality of the intermediate steps. This oversight can mask underlying problems, such as logical errors or unnecessary steps in the reasoning process. To measure reasoning beyond final-answer accuracy, we introduce ReasonEval, a new…

View the original paper on arXiv