Listen "Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration"
Episode Synopsis
In this episode:• Introduction: The Alchemy of Training: Professor Norris and Linda introduce the episode, joking about the 'black art' of hyperparameter tuning before unveiling the paper of the week: 'Completed Hyperparameter Transfer' by researchers at Apple.• Beyond Width: The Limits of muP: Linda explains the background of the Maximal Update Parametrization (muP) and why scaling only across model width isn't enough for modern LLMs, prompting skepticism from Norris about adding more complexity.• Enter Complete(d)P: A Unified Theory: The hosts dive into the core contribution: the Complete(d)P parameterization, discussing how it fixes issues with Query-Key norms and unifies scaling across depth, batch size, and training duration using SDE principles.• The Per-Module Revolution: Linda gets excited about the paper's boldest claim: optimizing hyperparameters specifically for different modules (like embeddings vs. attention heads), and explains the 'jagged' optimization landscape that requires Trust Region Random Search.• Scaling Up: 50 Million to 7 Billion: Discussion of the empirical results, focusing on how settings found on a small 50M parameter proxy model successfully transferred to a 7B model, resulting in significant training speed-ups.• Conclusion: A Skeptic Convinced: Professor Norris admits that the rigorous math behind the SDE scaling rules is convincing, and the duo wraps up with final thoughts on what this means for the future of efficient model training.
More episodes of the podcast Mechanical Dreams
Engram Paper
12/01/2026
From Entropy to Epiplexity- Rethinking Information for Computationally Bounded Intelligence
09/01/2026
Dion- Distributed Orthonormalized Updates
06/01/2026
Latent State Models of Training Dynamics
28/10/2025
DeepSeek OCR
24/10/2025
Untitled Episode
10/10/2025
ZARZA We are Zarza, the prestigious firm behind major projects in information technology.