arXiv 2601.18777

PRECISE: Reducing the Bias of LLM Evaluations Using Prediction-Powered Ranking Estimation

By Abhishek Divekar and Anirban Majumder

Published 2026-01-26

Mindmap

Browse the paper's core ideas, clusters, and relationships in a structured outline.

Evaluating the quality of search, ranking and RAG systems traditionally requires a significant number of human relevance annotations. In recent times, several deployed systems have explored the usage of Large Language Models (LLMs) as automated judges for this task while their inherent biases prevent direct use for metric estimation. We present a statistical framework extending Prediction-Powered Inference (PPI) tha…

View the original paper on arXiv