arXiv 2601.18777

PRECISE: Reducing the Bias of LLM Evaluations Using Prediction-Powered Ranking Estimation

By Abhishek Divekar and Anirban Majumder

Published 2026-01-26

Citation lineage

Review the prior work and downstream research connected to this paper.

Evaluating the quality of search, ranking and RAG systems traditionally requires a significant number of human relevance annotations. In recent times, several deployed systems have explored the usage of Large Language Models (LLMs) as automated judges for this task while their inherent biases prevent direct use for metric estimation. We present a statistical framework extending Prediction-Powered Inference (PPI) tha…

View the original paper on arXiv