arXiv 2601.14603

Variance-Adaptive Muon: Accelerating LLM Pretraining with NSR-Modulated and Variance-Scaled Momentum

By Jingru Li, Yibo Fan, et al.

Published 2026-01-21

Citation lineage

Review the prior work and downstream research connected to this paper.

Large Language Models (LLMs) achieve competitive performance across diverse natural language processing (NLP) tasks, yet pretraining is computationally demanding, making optimizer efficiency an important practical consideration. Muon accelerates LLM pretraining via orthogonal momentum updates that serve as a matrix analogue of the element-wise sign operator. Motivated by the recent perspective that Adam is a varianc…

View the original paper on arXiv