arXiv 2406.08598

Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks

By Justin Zhao, Flor Miriam Plaza-del-Arco, et al.

Published 2024-06-12

Mindmap

Browse the paper's core ideas, clusters, and relationships in a structured outline.

As Large Language Models (LLMs) continue to evolve, evaluating them remains a persistent challenge. Many recent evaluations use LLMs as judges to score outputs from other LLMs, often relying on a single large model like GPT-4o. However, using a single LLM judge is prone to intra-model bias, and many tasks - such as those related to emotional intelligence, creative writing, and persuasiveness - may be too subjective…

View the original paper on arXiv