arXiv 2511.04570
Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm
By Jingqi Tong, Yurong Mou, et al.
Published 2025-11-06
Discussion
Read the public discussion and references gathered around this paper.
"Thinking with Text" and "Thinking with Images" paradigm significantly improve the reasoning ability of large language models (LLMs) and Vision Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) The separation of text and vision as distinct modalities, hindering unified multimodal un…