arXiv 2601.15621

Qwen3-TTS Technical Report

By Hangrui Hu, Xinfa Zhu, et al.

Published 2026-01-22

Mindmap

Browse the paper's core ideas, clusters, and relationships in a structured outline.

In this report, we present the Qwen3-TTS series, a family of advanced multilingual, controllable, robust, and streaming text-to-speech models. Qwen3-TTS supports state-of-the-art 3-second voice cloning and description-based control, allowing both the creation of entirely novel voices and fine-grained manipulation over the output speech. Trained on over 5 million hours of speech data spanning 10 languages, Qwen3-TTS…

View the original paper on arXiv