arXiv 2509.16197
MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer
By Yanghao Li, Rui Qian, et al.
Published 2025-09-19
Mindmap
Browse the paper's core ideas, clusters, and relationships in a structured outline.
Unified multimodal Large Language Models (LLMs) that can both understand and generate visual content hold immense potential. However, existing open-source models often suffer from a performance trade-off between these capabilities. We present Manzano, a simple and scalable unified framework that substantially reduces this tension by coupling a hybrid image tokenizer with a well-curated training recipe. A single shar…