arXiv 2502.14445

PredictaBoard: Benchmarking LLM Score Predictability

By Lorenzo Pacchiardi, Konstantinos Voudouris, et al.

Published 2025-02-20

Mindmap

Browse the paper's core ideas, clusters, and relationships in a structured outline.

Despite possessing impressive skills, Large Language Models (LLMs) often fail unpredictably, demonstrating inconsistent success in even basic common sense reasoning tasks. This unpredictability poses a significant challenge to ensuring their safe deployment, as identifying and operating within a reliable "safe zone" is essential for mitigating risks. To address this, we present PredictaBoard, a novel collaborative b…

View the original paper on arXiv