arXiv 2509.23625

RIV: Recursive Introspection Mask Diffusion Vision Language Model

By YuQian Li, Limeng Qiao, et al.

Published 2025-09-28

Citation lineage

Review the prior work and downstream research connected to this paper.

Mask Diffusion-based Vision Language Models (MDVLMs) have achieved remarkable progress in multimodal understanding tasks. However, these models are unable to correct errors in generated tokens, meaning they lack self-correction capability. In this paper, we propose Recursive Introspection Mask Diffusion Vision Language Model (RIV), which equips the model with self-correction ability through two novel mechanisms. The…

View the original paper on arXiv