arXiv 2601.03233

LTX-2: Efficient Joint Audio-Visual Foundation Model

By Yoav HaCohen, Benny Brazowski, et al.

Published 2026-01-06

Mindmap

Browse the paper's core ideas, clusters, and relationships in a structured outline.

Recent text-to-video diffusion models can generate compelling video sequences, yet they remain silent -- missing the semantic, emotional, and atmospheric cues that audio provides. We introduce LTX-2, an open-source foundational model capable of generating high-quality, temporally synchronized audiovisual content in a unified manner. LTX-2 consists of an asymmetric dual-stream transformer with a 14B-parameter video s…

View the original paper on arXiv