Integrate Muon optimizer #2725

joecummings · 2025-05-13T14:54:10Z

The Muon optimizer has shown to be an efficient optimizer, potentially outpacing AdamW for LLM training. To quote Essential AI “Muon requires 10–15 % fewer tokens than AdamW to reach an identical loss and converts these savings into faster wall-clock convergence, with the advantage staying constant or growing as the batch size increases… These results establish Muon as a drop-in successor to AdamW for second-order optimization at scale.”

We'd love to accept a contribution of a canonical example of Muon in the torchtune library, specifically for our full SFT recipes (single device and multi GPU).

Artifacts

An implementation of the Muon optimizer as a Pytorch Optimizer
Any changes needed to the recipes to support the change for our feature set

Acceptance Criteria

Clean, well documented code with proper citations
Tests
Logs comparing Muon to AdamW for text training
Logs comparing Muon to AdamW for multimodal (image + text) training

Resources

joecummings added enhancement New feature or request community help wanted We would love the community's help completing this issue labels May 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate Muon optimizer #2725

Integrate Muon optimizer #2725

joecummings commented May 13, 2025

Integrate Muon optimizer #2725

Integrate Muon optimizer #2725

Comments

joecummings commented May 13, 2025

Artifacts

Acceptance Criteria

Resources