arXiv 2404.05692

Evaluating Mathematical Reasoning Beyond Accuracy

By Shijie Xia, Xuefeng Li, et al.

Published 2024-04-08

Wiki summary

Explore the paper's summary, context, and related research on Papiers.

The leaderboard of Large Language Models (LLMs) in mathematical tasks has been continuously updated. However, the majority of evaluations focus solely on the final results, neglecting the quality of the intermediate steps. This oversight can mask underlying problems, such as logical errors or unnecessary steps in the reasoning process. To measure reasoning beyond final-answer accuracy, we introduce ReasonEval, a new…

View the original paper on arXiv