arXiv 2407.13766

Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark

By Tsung-Han Wu, Giscard Biamby, et al.

Published 2024-07-18

Discussion

Read the public discussion and references gathered around this paper.

Large Multimodal Models (LMMs) have made significant strides in visual question-answering for single images. Recent advancements like long-context LMMs have allowed them to ingest larger, or even multiple, images. However, the ability to process a large number of visual tokens does not guarantee effective retrieval and reasoning for multi-image question answering (MIQA), especially in real-world applications like ph…

View the original paper on arXiv