arXiv 2306.05685

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

By Lianmin Zheng, Wei-Lin Chiang, et al.

Published 2023-06-09

Discussion

Read the public discussion and references gathered around this paper.

Evaluating large language model (LLM) based chat assistants is challenging due to their broad capabilities and the inadequacy of existing benchmarks in measuring human preferences. To address this, we explore using strong LLMs as judges to evaluate these models on more open-ended questions. We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases, as well as…

View the original paper on arXiv