arXiv 2507.15337

Reasoning Models are Test Exploiters: Rethinking Multiple-Choice

By Narun Raman, Taylor Lundy, et al.

Published 2025-07-21

Citation lineage

Review the prior work and downstream research connected to this paper.

When evaluating Large Language Models (LLMs) in question answering domains, it is common to ask the model to choose among a fixed set of choices (so-called multiple-choice question-answering, or MCQA). Although downstream tasks of interest typically do not provide systems with explicit options among which to choose, this approach is nevertheless widely used because it makes automatic grading straightforward and has…

View the original paper on arXiv