arXiv 2506.09987

A Shortcut-aware Video-QA Benchmark for Physical Understanding via Minimal Video Pairs

By Benno Krojer, Mojtaba Komeili, et al.

Published 2025-06-11

Wiki summary

Explore the paper's summary, context, and related research on Papiers.

Existing benchmarks for assessing the spatio-temporal understanding and reasoning abilities of video language models are susceptible to score inflation due to the presence of shortcut solutions based on superficial visual or textual cues. This paper mitigates the challenges in accurately assessing model performance by introducing the Minimal Video Pairs (MVP) benchmark, a simple shortcut-aware video QA benchmark for…

View the original paper on arXiv