arXiv 2510.08377

UniVideo: Unified Understanding, Generation, and Editing for Videos

By Cong Wei, Quande Liu, et al.

Published 2025-10-09

Mindmap

Browse the paper's core ideas, clusters, and relationships in a structured outline.

Unified multimodal models have shown promising results in multimodal content generation and editing but remain largely limited to the image domain. In this work, we present UniVideo, a versatile framework that extends unified modeling to the video domain. UniVideo adopts a dual-stream design, combining a Multimodal Large Language Model (MLLM) for instruction understanding with a Multimodal DiT (MMDiT) for video gene…

View the original paper on arXiv