arXiv 2502.14445
PredictaBoard: Benchmarking LLM Score Predictability
By Lorenzo Pacchiardi, Konstantinos Voudouris, et al.
Published 2025-02-20
Mindmap
Browse the paper's core ideas, clusters, and relationships in a structured outline.
Despite possessing impressive skills, Large Language Models (LLMs) often fail unpredictably, demonstrating inconsistent success in even basic common sense reasoning tasks. This unpredictability poses a significant challenge to ensuring their safe deployment, as identifying and operating within a reliable "safe zone" is essential for mitigating risks. To address this, we present PredictaBoard, a novel collaborative b…