arXiv 2509.23625
RIV: Recursive Introspection Mask Diffusion Vision Language Model
By YuQian Li, Limeng Qiao, et al.
Published 2025-09-28
Wiki summary
Explore the paper's summary, context, and related research on Papiers.
Mask Diffusion-based Vision Language Models (MDVLMs) have achieved remarkable progress in multimodal understanding tasks. However, these models are unable to correct errors in generated tokens, meaning they lack self-correction capability. In this paper, we propose Recursive Introspection Mask Diffusion Vision Language Model (RIV), which equips the model with self-correction ability through two novel mechanisms. The…