arXiv 2203.15867

Image Retrieval from Contextual Descriptions

By Benno Krojer, Vaibhav Adlakha, et al.

Published 2022-03-29

Discussion

Read the public discussion and references gathered around this paper.

The ability to integrate context, including perceptual and temporal cues, plays a pivotal role in grounding the meaning of a linguistic utterance. In order to measure to what extent current vision-and-language models master this ability, we devise a new multimodal challenge, Image Retrieval from Contextual Descriptions (ImageCoDe). In particular, models are tasked with retrieving the correct image from a set of 10 m…

View the original paper on arXiv