arXiv 2509.16197

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

By Yanghao Li, Rui Qian, et al.

Published 2025-09-19

Mindmap

Browse the paper's core ideas, clusters, and relationships in a structured outline.

Unified multimodal Large Language Models (LLMs) that can both understand and generate visual content hold immense potential. However, existing open-source models often suffer from a performance trade-off between these capabilities. We present Manzano, a simple and scalable unified framework that substantially reduces this tension by coupling a hybrid image tokenizer with a well-curated training recipe. A single shar…

View the original paper on arXiv