arXiv 2511.21140
How to Correctly Report LLM-as-a-Judge Evaluations
By Chungpa Lee, Thomas Zeng, et al.
Published 2025-11-26
Wiki summary
Explore the paper's summary, context, and related research on Papiers.
Large language models (LLMs) are increasingly used as evaluators in lieu of humans. While scalable, their judgments are noisy due to imperfect specificity and sensitivity of LLMs, leading to biased accuracy estimates. Although bias-correction methods exist, they are underutilized in LLM research and typically assume exact knowledge of the model's specificity and sensitivity. Furthermore, in general we only have esti…