arXiv 2305.16397

Are Diffusion Models Vision-And-Language Reasoners?

By Benno Krojer, Elinor Poole-Dayan, et al.

Published 2023-05-25

Mindmap

Browse the paper's core ideas, clusters, and relationships in a structured outline.

Text-conditioned image generation models have recently shown immense qualitative success using denoising diffusion processes. However, unlike discriminative vision-and-language models, it is a non-trivial task to subject these diffusion-based generative models to automatic fine-grained quantitative evaluation of high-level phenomena such as compositionality. Towards this goal, we perform two innovations. First, we t…

View the original paper on arXiv