arXiv 2511.21140

How to Correctly Report LLM-as-a-Judge Evaluations

By Chungpa Lee, Thomas Zeng, et al.

Published 2025-11-26

Wiki summary

Explore the paper's summary, context, and related research on Papiers.

Large language models (LLMs) are increasingly used as evaluators in lieu of humans. While scalable, their judgments are noisy due to imperfect specificity and sensitivity of LLMs, leading to biased accuracy estimates. Although bias-correction methods exist, they are underutilized in LLM research and typically assume exact knowledge of the model's specificity and sensitivity. Furthermore, in general we only have esti…

View the original paper on arXiv