arXiv 2502.14445
PredictaBoard: Benchmarking LLM Score Predictability
By Lorenzo Pacchiardi, Konstantinos Voudouris, et al.
Published 2025-02-20
Wiki summary
Explore the paper's summary, context, and related research on Papiers.
Despite possessing impressive skills, Large Language Models (LLMs) often fail unpredictably, demonstrating inconsistent success in even basic common sense reasoning tasks. This unpredictability poses a significant challenge to ensuring their safe deployment, as identifying and operating within a reliable "safe zone" is essential for mitigating risks. To address this, we present PredictaBoard, a novel collaborative b…