Essential AI Logo

Muon, muP, and the Compute‑Time Tradeoff

Essential AI • Optimization

May 12, 2025

Adam and AdamW are vital ingredients in the pretraining recipe, dominating neural network optimization. Recently, Muon — a surprisingly simple second‑order optimizer — has emerged as a potential alternative. In our paper, we ask:

  1. Is Muon a robust replacement for AdamW?
  2. Does Muon work well with muP (maximal update parameterization)?

We demonstrate that Muon [Bernstein, Keller, Moonshot] achieves better compute‑time tradeoffs than AdamW, especially at large batch sizes. It pairs naturally with muP, enabling lightweight hyperparameter transfer and easy efficiency wins for pretraining LLMs.


Why This Matters

For pretraining, what matters is the time and compute to reach your target loss given the hardware you have. All else equal, you want an optimizer that reaches the same loss with fewer tokens, or in less wall‑clock time.

We find that Muon accomplishes both.


What Muon Does Differently

Muon is a lightweight second‑order optimizer that can be viewed as a special case of Shampoo under certain assumptions [shampoo‑reduction]. It approximates second‑order information without storing or inverting large matrices. Using a Newton‑Schulz iteration and only a first‑moment state, it’s even leaner than AdamW and scales well at large batch sizes.


Key Results

1. Better Compute‑Time Tradeoffs

We trained decoder‑style transformer models (100M → 4B) on Python code and general web data (DCLM). Across all settings, Muon reached target losses faster and with fewer tokens than AdamW.


2. Data Efficiency at Scale

At batch sizes up to 16M tokens, Muon needed 10–15% fewer tokens than AdamW to reach the same loss. The relative advantage persists — and often grows — with batch size.


3. muP Works with Muon

We used muP to transfer hyperparameters from small models to a 3.7B model (seq len 8192), and the transfer held — for both learning rate and weight decay. A “telescoping” sweep narrows the search space as width increases, keeping large‑model sweeps tractable.


Summary

  • Muon beats AdamW in compute‑time tradeoff across losses and batch sizes.
  • muP works with Muon, enabling scalable hyperparameter transfer. Our telescoping sweep makes this practical.

Together, they form a practical recipe for large‑scale pretraining.


Resources