You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Muon optimizer has shown to be an efficient optimizer, potentially outpacing AdamW for LLM training. To quote Essential AI “Muon requires 10–15 % fewer tokens than AdamW to reach an identical loss and converts these savings into faster wall-clock convergence, with the advantage staying constant or growing as the batch size increases… These results establish Muon as a drop-in successor to AdamW for second-order optimization at scale.”
We'd love to accept a contribution of a canonical example of Muon in the torchtune library, specifically for our full SFT recipes (single device and multi GPU).
Artifacts
An implementation of the Muon optimizer as a Pytorch Optimizer
Any changes needed to the recipes to support the change for our feature set
Acceptance Criteria
Clean, well documented code with proper citations
Tests
Logs comparing Muon to AdamW for text training
Logs comparing Muon to AdamW for multimodal (image + text) training
The Muon optimizer has shown to be an efficient optimizer, potentially outpacing AdamW for LLM training. To quote Essential AI “Muon requires 10–15 % fewer tokens than AdamW to reach an identical loss and converts these savings into faster wall-clock convergence, with the advantage staying constant or growing as the batch size increases… These results establish Muon as a drop-in successor to AdamW for second-order optimization at scale.”
We'd love to accept a contribution of a canonical example of Muon in the torchtune library, specifically for our full SFT recipes (single device and multi GPU).
Artifacts
Acceptance Criteria
Resources
The text was updated successfully, but these errors were encountered: