arXiv 2203.15867
Image Retrieval from Contextual Descriptions
By Benno Krojer, Vaibhav Adlakha, et al.
Published 2022-03-29
Discussion
Read the public discussion and references gathered around this paper.
The ability to integrate context, including perceptual and temporal cues, plays a pivotal role in grounding the meaning of a linguistic utterance. In order to measure to what extent current vision-and-language models master this ability, we devise a new multimodal challenge, Image Retrieval from Contextual Descriptions (ImageCoDe). In particular, models are tasked with retrieving the correct image from a set of 10 m…