arXiv 2407.13766
Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark
By Tsung-Han Wu, Giscard Biamby, et al.
Published 2024-07-18
Discussion
Read the public discussion and references gathered around this paper.
Large Multimodal Models (LMMs) have made significant strides in visual question-answering for single images. Recent advancements like long-context LMMs have allowed them to ingest larger, or even multiple, images. However, the ability to process a large number of visual tokens does not guarantee effective retrieval and reasoning for multi-image question answering (MIQA), especially in real-world applications like ph…