arXiv 2511.21140
How to Correctly Report LLM-as-a-Judge Evaluations
By Chungpa Lee, Thomas Zeng, et al.
Published 2025-11-26
Citation lineage
Review the prior work and downstream research connected to this paper.
Large language models (LLMs) are increasingly used as evaluators in lieu of humans. While scalable, their judgments are noisy due to imperfect specificity and sensitivity of LLMs, leading to biased accuracy estimates. Although bias-correction methods exist, they are underutilized in LLM research and typically assume exact knowledge of the model's specificity and sensitivity. Furthermore, in general we only have esti…