arXiv 2306.05685
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
By Lianmin Zheng, Wei-Lin Chiang, et al.
Published 2023-06-09
Discussion
Read the public discussion and references gathered around this paper.
Evaluating large language model (LLM) based chat assistants is challenging due to their broad capabilities and the inadequacy of existing benchmarks in measuring human preferences. To address this, we explore using strong LLMs as judges to evaluate these models on more open-ended questions. We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases, as well as…