arXiv 2506.16659
A Minimalist Optimizer Design for LLM Pretraining
By Athanasios Glentis, Jiaxiang Li, et al.
Published 2025-06-20
Wiki summary
Explore the paper's summary, context, and related research on Papiers.
Training large language models (LLMs) typically relies on adaptive optimizers such as Adam, which require significant memory to maintain first- and second-moment matrices, known as optimizer states. While recent works such as GaLore, Fira, and APOLLO have proposed state-compressed variants to reduce memory consumption, a fundamental question remains: What is the minimal amount of optimizer state that is truly necess…