arXiv 2601.03233
LTX-2: Efficient Joint Audio-Visual Foundation Model
By Yoav HaCohen, Benny Brazowski, et al.
Published 2026-01-06
Citation lineage
Review the prior work and downstream research connected to this paper.
Recent text-to-video diffusion models can generate compelling video sequences, yet they remain silent -- missing the semantic, emotional, and atmospheric cues that audio provides. We introduce LTX-2, an open-source foundational model capable of generating high-quality, temporally synchronized audiovisual content in a unified manner. LTX-2 consists of an asymmetric dual-stream transformer with a 14B-parameter video sā¦