arXiv 2511.21140

How to Correctly Report LLM-as-a-Judge Evaluations

By Chungpa Lee, Thomas Zeng, et al.

Published 2025-11-26

Citation lineage

Review the prior work and downstream research connected to this paper.

Large language models (LLMs) are increasingly used as evaluators in lieu of humans. While scalable, their judgments are noisy due to imperfect specificity and sensitivity of LLMs, leading to biased accuracy estimates. Although bias-correction methods exist, they are underutilized in LLM research and typically assume exact knowledge of the model's specificity and sensitivity. Furthermore, in general we only have esti…

View the original paper on arXiv