arXiv 2511.04570
Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm
By Jingqi Tong, Yurong Mou, et al.
Published 2025-11-06
Citation lineage
Review the prior work and downstream research connected to this paper.
"Thinking with Text" and "Thinking with Images" paradigm significantly improve the reasoning ability of large language models (LLMs) and Vision Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) The separation of text and vision as distinct modalities, hindering unified multimodal un…