arXiv 2502.09992

Large Language Diffusion Models

By Shen Nie, Fengqi Zhu, et al.

Published 2025-02-14

Discussion

Read the public discussion and references gathered around this paper.

The capabilities of large language models (LLMs) are widely regarded as relying on autoregressive models (ARMs). We challenge this notion by introducing LLaDA, a diffusion model trained from scratch under the pre-training and supervised fine-tuning (SFT) paradigm. LLaDA employs a forward data masking process and a reverse generation process, parameterized by a Transformer to predict masked tokens. It provides a prin…

View the original paper on arXiv