arXiv 2306.05685

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

By Lianmin Zheng, Wei-Lin Chiang, et al.

Published 2023-06-09

Citation lineage

Review the prior work and downstream research connected to this paper.

Evaluating large language model (LLM) based chat assistants is challenging due to their broad capabilities and the inadequacy of existing benchmarks in measuring human preferences. To address this, we explore using strong LLMs as judges to evaluate these models on more open-ended questions. We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases, as well as…

View the original paper on arXiv