arXiv 2503.01996

One ruler to measure them all: Benchmarking multilingual long-context language models

By Yekyung Kim, Jenna Russell, et al.

Published 2025-03-03

Mindmap

Browse the paper's core ideas, clusters, and relationships in a structured outline.

We present ONERULER, a multilingual benchmark designed to evaluate long-context language models across 26 languages. ONERULER adapts the English-only RULER benchmark (Hsieh et al., 2024) by including seven synthetic tasks that test both retrieval and aggregation, including new variations of the "needle-in-a-haystack" task that allow for the possibility of a nonexistent needle. We create ONERULER through a two-step p…

View the original paper on arXiv