arXiv 2506.09987
A Shortcut-aware Video-QA Benchmark for Physical Understanding via Minimal Video Pairs
By Benno Krojer, Mojtaba Komeili, et al.
Published 2025-06-11
Discussion
Read the public discussion and references gathered around this paper.
Existing benchmarks for assessing the spatio-temporal understanding and reasoning abilities of video language models are susceptible to score inflation due to the presence of shortcut solutions based on superficial visual or textual cues. This paper mitigates the challenges in accurately assessing model performance by introducing the Minimal Video Pairs (MVP) benchmark, a simple shortcut-aware video QA benchmark for…