Muon, muP, and the Compute-Time Tradeoff

May 2, 2025

Recent posts by Jeremy Bernstein, Keller Jordan, and a Moonshot tweet have built a lot of excitement around Muon, a lightweight second-order optimizer for the training of large language models. In light of these posts, two key questions come up:

  1. Is Muon a robust replacement for AdamW?

  2. Does Muon work well with muP, the maximal update parameterization?

We answer both of these questions in the affirmative. In this post, we share new experimental results showing that Muon not only works with muPβ€”it delivers better compute-time tradeoffs than AdamW, especially at large batch sizes. When paired with a lightweight hyperparameter tuning strategy, it becomes a practical alternative at scale.

Why This Matters

The true cost of training isn't just steps or FLOPs. It's how much time and compute it takes to reach your target loss. If an optimizer can hit the same loss with fewer tokens, or faster wall time, it directly impacts deployment timelines and cloud costs.

Muon helps on both fronts.

What Muon Does Differently

Muon is a lightweight second-order optimizer. It estimates curvature information about the optimization landscape using spectral norm constraints, without storing or inverting large matrices.

It's conceptually related to Shampoo β€” and in certain limits, it reduces to it as noted by @_arohan_ on Twitter β€” but is much simpler to implement and cheaper to run. It uses a Newton-Schulz iteration to approximate the update direction, and only maintains a first-moment state, making it even leaner than AdamW.

The result is a scalable optimizer that remains efficient at large batch sizes, while still capturing second-order correlations in the data.

Key Results

1. Better Compute-Time Tradeoffs

We trained decoder-style transformer models (100M to 4B params) on Python code and general web data (DCLM). Across all settings, Muon reached target losses faster and with fewer than or the same number of devices as AdamW.

πŸ“Š Figure: Muon's iso-loss curves dominate AdamW in the compute-vs-time plane. (See: Figure 2, PDF page 4)

2. Data Efficiency at Scale

At batch sizes up to 16M tokens, Muon needed 10–15% fewer tokens than AdamW to reach the same loss. The relative advantage persists and often grows with batch size.

πŸ“Š Figure: Token ratio between AdamW and Muon increases at large B. (See: Figure 3, PDF page 5)

3. muP Works with Muon

A recent tweet by @Kimi_Moonshot asked if muP can be used with Muon. We found that it can.

We used muP to transfer hyperparameters from small models to a 3.7B model (sequence length 8192), and the transfer held β€” both for learning rate and weight decay.

Tuning at Scale: The Telescoping Sweep

To avoid expensive grid searches at large scale, we used a telescoping sweep:

  • Wide sweep at small model widths

  • Narrower, coarser sweeps as width doubles

  • Total cost grows like \(O(C \log N)\) where \(C\) is the cost of final model training

πŸ“Š Figure: Refining hyperparameters across model widths (See: Figure 6, PDF page 9)

Summary

  • Muon beats AdamW in compute-time tradeoff across losses and batch sizes

  • muP works with Muon, allowing scalable hyperparameter transfer

  • Tuning cost is low with telescoping sweeps

Together, they form a practical recipe for large-scale pretraining:

Muon + muP + telescoping sweep
= better performance, lower cost, same simplicity.

Resources