arXiv 2502.14445
PredictaBoard: Benchmarking LLM Score Predictability
By Lorenzo Pacchiardi, Konstantinos Voudouris, et al.
Published 2025-02-20
Citation lineage
Review the prior work and downstream research connected to this paper.
Despite possessing impressive skills, Large Language Models (LLMs) often fail unpredictably, demonstrating inconsistent success in even basic common sense reasoning tasks. This unpredictability poses a significant challenge to ensuring their safe deployment, as identifying and operating within a reliable "safe zone" is essential for mitigating risks. To address this, we present PredictaBoard, a novel collaborative b…