arXiv 2510.08377

UniVideo: Unified Understanding, Generation, and Editing for Videos

By Cong Wei, Quande Liu, et al.

Published 2025-10-09

Citation lineage

Review the prior work and downstream research connected to this paper.

Unified multimodal models have shown promising results in multimodal content generation and editing but remain largely limited to the image domain. In this work, we present UniVideo, a versatile framework that extends unified modeling to the video domain. UniVideo adopts a dual-stream design, combining a Multimodal Large Language Model (MLLM) for instruction understanding with a Multimodal DiT (MMDiT) for video gene…

View the original paper on arXiv