Muon Doesn’t Clearly Grok faster

June 20, 2025

After a decade of dominance, we're seeing promising second-order alternatives to Adam emerge. Among them, Muon has the right balance of simplicity and performance. In our prior work [muon], we showed that Muon achieves better compute-time tradeoffs during pre-training. Motivated by recent claims that Muon accelerates grokking [Tveit et al.], we explore grokking as a potential testbed to better understand how different optimizers affect learning dynamics.

Based on our results, we find grokking to be an insufficient testbed for disentangling the learning dynamics of the optimizers we study. As [Tveit et al.] observe, Muon indeed groks faster than AdamW under certain conditions, however, when exploring a wider range of hyperparameters and model sizes, this advantage disappears. The onset and duration of grokking were highly sensitive to factors such as batch size and embedding dimension, which made it difficult to isolate optimizer-specific dynamics.

While we only offer empirical evidence, we hope our findings help the community understand grokking better and encourage further work in disentangling the learning behaviors of different optimizers.

Goal of the study

Grokking is the phenomenon where models achieve perfect training accuracy early but continue to perform poorly on the test set, only to generalize after prolonged overfitting.

We aimed to explore:

Does Muon achieve better token efficiency than AdamW in an algorithmic grokking task?
How does gradient update rank affect this?
How do hyperparameters like embedding dimension and batch size affect tradeoffs?

Our approach

We evaluate on a modular division dataset (base 97) with a 50/50 train/test split [Power et al.]. Grokking is defined as the training step where validation accuracy is within 1% of its maximum and exceeds 95% at some point in training.

Measuring Grokking start

We define Grokking start as the earlier of:

First-order gradient peak
Second-order gradient peak

What we found

1. Larger embedding dimension → faster grokking.

2. Larger batch size → slower grokking.

3. Muon didn’t consistently beat AdamW; results varied with hyperparameters.

Conclusion