arXiv 2502.14445

PredictaBoard: Benchmarking LLM Score Predictability

By Lorenzo Pacchiardi, Konstantinos Voudouris, et al.

Published 2025-02-20

Citation lineage

Review the prior work and downstream research connected to this paper.

Despite possessing impressive skills, Large Language Models (LLMs) often fail unpredictably, demonstrating inconsistent success in even basic common sense reasoning tasks. This unpredictability poses a significant challenge to ensuring their safe deployment, as identifying and operating within a reliable "safe zone" is essential for mitigating risks. To address this, we present PredictaBoard, a novel collaborative b…

View the original paper on arXiv