Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration

08/01/2026 19 min

Listen "Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration"

Episode Synopsis

In this episode:• Introduction: The Alchemy of Training: Professor Norris and Linda introduce the episode, joking about the 'black art' of hyperparameter tuning before unveiling the paper of the week: 'Completed Hyperparameter Transfer' by researchers at Apple.• Beyond Width: The Limits of muP: Linda explains the background of the Maximal Update Parametrization (muP) and why scaling only across model width isn't enough for modern LLMs, prompting skepticism from Norris about adding more complexity.• Enter Complete(d)P: A Unified Theory: The hosts dive into the core contribution: the Complete(d)P parameterization, discussing how it fixes issues with Query-Key norms and unifies scaling across depth, batch size, and training duration using SDE principles.• The Per-Module Revolution: Linda gets excited about the paper's boldest claim: optimizing hyperparameters specifically for different modules (like embeddings vs. attention heads), and explains the 'jagged' optimization landscape that requires Trust Region Random Search.• Scaling Up: 50 Million to 7 Billion: Discussion of the empirical results, focusing on how settings found on a small 50M parameter proxy model successfully transferred to a 7B model, resulting in significant training speed-ups.• Conclusion: A Skeptic Convinced: Professor Norris admits that the rigorous math behind the SDE scaling rules is convincing, and the duo wraps up with final thoughts on what this means for the future of efficient model training.

More episodes of the podcast Mechanical Dreams

Engram Paper 12/01/2026

From Entropy to Epiplexity- Rethinking Information for Computationally Bounded Intelligence 09/01/2026

NorMuon- Making Muon more efficient and scalable 07/01/2026

Dion- Distributed Orthonormalized Updates 06/01/2026

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining 05/01/2026

Latent State Models of Training Dynamics 28/10/2025

DeepSeek OCR 24/10/2025

The Coverage Principle - How Pre-training Enables Post-Training 23/10/2025

Continual Learning via Sparse Memory Finetuning 22/10/2025

Untitled Episode 10/10/2025

Ver todos los episodios

ZARZA We are Zarza, the prestigious firm behind major projects in information technology.

Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration

Listen "Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration"

Episode Synopsis

More episodes of the podcast Mechanical Dreams

White Hat Hacking, Ethical Hackers…

Do you work sitting down? Do active breaks

Bandwidth: Broadband or Narrowband?

Personnel recruitment via Web

Deep web or Invisible Internet

Subdomains, a glance with the experts!

Free Internet, a prediction in Nostradamus style

Educational Technology: From traditional to digital

Localhost, there’s no place like 127.0.0.1

Googling with breathtaking tricks you ignore

Gray Hat Hacking, those with ambiguous ethics…

Internet Predators on the prowl

Dot COM: The Internet’s dominant TLD