arXiv 2506.16659

A Minimalist Optimizer Design for LLM Pretraining

By Athanasios Glentis, Jiaxiang Li, et al.

Published 2025-06-20

Mindmap

Browse the paper's core ideas, clusters, and relationships in a structured outline.

Training large language models (LLMs) typically relies on adaptive optimizers such as Adam, which require significant memory to maintain first- and second-moment matrices, known as optimizer states. While recent works such as GaLore, Fira, and APOLLO have proposed state-compressed variants to reduce memory consumption, a fundamental question remains: What is the minimal amount of optimizer state that is truly necess…

View the original paper on arXiv