Dion- Distributed Orthonormalized Updates

06/01/2026 18 min
Dion- Distributed Orthonormalized Updates

Listen "Dion- Distributed Orthonormalized Updates"

Episode Synopsis

In this episode:• The GPU Bill Blues: Professor Norris laments the exorbitant cost of training large models, setting the stage for Linda to introduce the episode's focus: 'Dion: Distributed Orthonormalized Updates' by researchers from Microsoft and Harvard.• Muon's Heavy Lifting: Linda explains the predecessor, the Muon optimizer, and its orthonormalization benefits. Norris questions why a new method is needed, leading to a discussion on how Newton-Schulz iterations become a communication bottleneck in sharded distributed training.• Rethinking Linear Algebra: Linda details Dion's core innovation: replacing full matrix reconstruction with amortized power iteration on a momentum buffer. Norris is skeptical about the math, but Linda explains how this integrates cleanly with weight sharding.• The Magic of Error Feedback: The hosts discuss the 'rank-fraction' parameter and how low-rank updates save compute. Linda explains the crucial role of 'error feedback' in maintaining accuracy, finally winning over Norris's skepticism.• Lazy Updates and CPU Offloading: A look at the algorithmic flexibility of Dion, including 'Lazy-Dion' and CPU offloading variants. They discuss the experimental results showing Dion matching Muon's performance with significantly lower wall-clock time.• Future-Proofing Optimization: Professor Norris admits the elegance of the solution. The pair wraps up with thoughts on how Dion might become the standard for training next-generation foundation models.