arXiv 2511.02817

Oolong: Evaluating Long Context Reasoning and Aggregation Capabilities

By Amanda Bertsch, Adithya Pratapa, et al.

Published 2025-11-04

Discussion

Read the public discussion and references gathered around this paper.

As model context lengths continue to grow, concerns about whether models effectively use the full context length have persisted. While several carefully designed long-context evaluations have recently been released, these evaluations tend to rely on retrieval from one or more sections of the context, which allows nearly all of the context tokens to be disregarded as noise. This represents only one type of task that…

View the original paper on arXiv