arXiv 2406.08598
Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks
By Justin Zhao, Flor Miriam Plaza-del-Arco, et al.
Published 2024-06-12
Mindmap
Browse the paper's core ideas, clusters, and relationships in a structured outline.
As Large Language Models (LLMs) continue to evolve, evaluating them remains a persistent challenge. Many recent evaluations use LLMs as judges to score outputs from other LLMs, often relying on a single large model like GPT-4o. However, using a single LLM judge is prone to intra-model bias, and many tasks - such as those related to emotional intelligence, creative writing, and persuasiveness - may be too subjective…