arXiv 2506.16659
A Minimalist Optimizer Design for LLM Pretraining
By Athanasios Glentis, Jiaxiang Li, et al.
Published 2025-06-20
Mindmap
Browse the paper's core ideas, clusters, and relationships in a structured outline.
Training large language models (LLMs) typically relies on adaptive optimizers such as Adam, which require significant memory to maintain first- and second-moment matrices, known as optimizer states. While recent works such as GaLore, Fira, and APOLLO have proposed state-compressed variants to reduce memory consumption, a fundamental question remains: What is the minimal amount of optimizer state that is truly necess…