arXiv 2511.04570

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

By Jingqi Tong, Yurong Mou, et al.

Published 2025-11-06

Citation lineage

Review the prior work and downstream research connected to this paper.

"Thinking with Text" and "Thinking with Images" paradigm significantly improve the reasoning ability of large language models (LLMs) and Vision Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) The separation of text and vision as distinct modalities, hindering unified multimodal un…

View the original paper on arXiv