arXiv 2509.23625
RIV: Recursive Introspection Mask Diffusion Vision Language Model
By YuQian Li, Limeng Qiao, et al.
Published 2025-09-28
Citation lineage
Review the prior work and downstream research connected to this paper.
Mask Diffusion-based Vision Language Models (MDVLMs) have achieved remarkable progress in multimodal understanding tasks. However, these models are unable to correct errors in generated tokens, meaning they lack self-correction capability. In this paper, we propose Recursive Introspection Mask Diffusion Vision Language Model (RIV), which equips the model with self-correction ability through two novel mechanisms. The…