arXiv 2506.09987
A Shortcut-aware Video-QA Benchmark for Physical Understanding via Minimal Video Pairs
By Benno Krojer, Mojtaba Komeili, et al.
Published 2025-06-11
Mindmap
Browse the paper's core ideas, clusters, and relationships in a structured outline.
Existing benchmarks for assessing the spatio-temporal understanding and reasoning abilities of video language models are susceptible to score inflation due to the presence of shortcut solutions based on superficial visual or textual cues. This paper mitigates the challenges in accurately assessing model performance by introducing the Minimal Video Pairs (MVP) benchmark, a simple shortcut-aware video QA benchmark for…