Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration

08/01/2026 19 min
Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration

Listen "Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration"

Episode Synopsis

In this episode:• Introduction: The Alchemy of Training: Professor Norris and Linda introduce the episode, joking about the 'black art' of hyperparameter tuning before unveiling the paper of the week: 'Completed Hyperparameter Transfer' by researchers at Apple.• Beyond Width: The Limits of muP: Linda explains the background of the Maximal Update Parametrization (muP) and why scaling only across model width isn't enough for modern LLMs, prompting skepticism from Norris about adding more complexity.• Enter Complete(d)P: A Unified Theory: The hosts dive into the core contribution: the Complete(d)P parameterization, discussing how it fixes issues with Query-Key norms and unifies scaling across depth, batch size, and training duration using SDE principles.• The Per-Module Revolution: Linda gets excited about the paper's boldest claim: optimizing hyperparameters specifically for different modules (like embeddings vs. attention heads), and explains the 'jagged' optimization landscape that requires Trust Region Random Search.• Scaling Up: 50 Million to 7 Billion: Discussion of the empirical results, focusing on how settings found on a small 50M parameter proxy model successfully transferred to a 7B model, resulting in significant training speed-ups.• Conclusion: A Skeptic Convinced: Professor Norris admits that the rigorous math behind the SDE scaling rules is convincing, and the duo wraps up with final thoughts on what this means for the future of efficient model training.