arXiv 2507.15337

Reasoning Models are Test Exploiters: Rethinking Multiple-Choice

By Narun Raman, Taylor Lundy, et al.

Published 2025-07-21

Discussion

Read the public discussion and references gathered around this paper.

When evaluating Large Language Models (LLMs) in question answering domains, it is common to ask the model to choose among a fixed set of choices (so-called multiple-choice question-answering, or MCQA). Although downstream tasks of interest typically do not provide systems with explicit options among which to choose, this approach is nevertheless widely used because it makes automatic grading straightforward and has…

View the original paper on arXiv