Muon, muP, and the Compute-Time Tradeoff
May 2, 2025
Recent posts by Jeremy Bernstein, Keller Jordan, and a Moonshot tweet have built a lot of excitement around Muon, a lightweight second-order optimizer for the training of large language models. In light of these posts, two key questions come up:
Is Muon a robust replacement for AdamW?
Does Muon work well with muP, the maximal update parameterization?
We answer both of these questions in the affirmative. In this post, we share new experimental results showing that Muon not only works with muPβit delivers better compute-time tradeoffs than AdamW, especially at large batch sizes. When paired with a lightweight hyperparameter tuning strategy, it becomes a practical alternative at scale.
Why This Matters
The true cost of training isn't just steps or FLOPs. It's how much time and compute it takes to reach your target loss. If an optimizer can hit the same loss with fewer tokens, or faster wall time, it directly impacts deployment timelines and cloud costs.
Muon helps on both fronts.
What Muon Does Differently
Muon is a lightweight second-order optimizer. It estimates curvature information about the optimization landscape using spectral norm constraints, without storing or inverting large matrices.
It's conceptually related to Shampoo β and in certain limits, it reduces to it as noted by @_arohan_ on Twitter β but is much simpler to implement and cheaper to run. It uses a Newton-Schulz iteration to approximate the update direction, and only maintains a first-moment state, making it even leaner than AdamW.
The result is a scalable optimizer that remains efficient at large batch sizes, while still capturing second-order correlations in the data.
Key Results
1. Better Compute-Time Tradeoffs
We trained decoder-style transformer models (100M to 4B params) on Python code and general web data (DCLM). Across all settings, Muon reached target losses faster and with fewer than or the same number of devices as AdamW.
π Figure: Muon's iso-loss curves dominate AdamW in the compute-vs-time plane. (See: Figure 2, PDF page 4)
2. Data Efficiency at Scale
At batch sizes up to 16M tokens, Muon needed 10β15% fewer tokens than AdamW to reach the same loss. The relative advantage persists and often grows with batch size.
π Figure: Token ratio between AdamW and Muon increases at large B. (See: Figure 3, PDF page 5)
3. muP Works with Muon
A recent tweet by @Kimi_Moonshot asked if muP can be used with Muon. We found that it can.
We used muP to transfer hyperparameters from small models to a 3.7B model (sequence length 8192), and the transfer held β both for learning rate and weight decay.
Tuning at Scale: The Telescoping Sweep
To avoid expensive grid searches at large scale, we used a telescoping sweep:
Wide sweep at small model widths
Narrower, coarser sweeps as width doubles
Total cost grows like \(O(C \log N)\) where \(C\) is the cost of final model training
π Figure: Refining hyperparameters across model widths (See: Figure 6, PDF page 9)
Summary
Muon beats AdamW in compute-time tradeoff across losses and batch sizes
muP works with Muon, allowing scalable hyperparameter transfer
Tuning cost is low with telescoping sweeps
Together, they form a practical recipe for large-scale pretraining:
Muon + muP + telescoping sweep
= better performance, lower cost, same simplicity.
Resources