EP8: Training Models at Scale | AWS for AI Podcast

02/09/2025 1h 4min Episodio 8

Listen "EP8: Training Models at Scale | AWS for AI Podcast"

Episode Synopsis

Join us for an enlightening conversation with Anton Alexander, AWS's Senior Specialist for Worldwide Foundation Models, as we delve into the complexities of training and scaling large foundation models. Anton brings his unique expertise from working with the world's top model builders, along with his fascinating journey from Trinidad and Tobago to becoming a leading AI infrastructure expert.Discover practical insights on managing massive GPU clusters, optimizing distributed training, and handling the critical challenges of model development at scale. Learn about cutting-edge solutions in GPU failure detection, checkpointing strategies, and the evolution of inference workloads. Get an insider's perspective on emerging trends like GRPO, visual LLMs, and the future of AI model development.Don't miss this technical deep dive where we explore real-world solutions for building and deploying foundational AI models, featuring discussions on everything from low-level infrastructure optimization to high-level AI development strategies.Learn more: http://go.aws/47yubYqAmazon SageMaker HyperPod : https://aws.amazon.com/fr/sagemaker/ai/hyperpod/The Llama 3 Herd of Models paper : https://arxiv.org/abs/2407.21783Chapters:00:00:00 : Introduction and Guest Background00:01:18 : Anton Journey from Caribbean to AI 00:05:52 : Mathematics in AI 00:07:20 : Large Model Training Challenges00:09:54 : GPU failures : Lama Herd of models 00:13:40 : Grey failures 00:15:05 : Model training trends00:17:40 : Managing Mixture of Experts Models00:21:50 : Estimate how many GPUs you need.00:25:12 : Monitoring loss function 00:27:08 : Crashing trainings 00:28:10 : SageMaker Hyperpod story00:32:15 : How we automate managing grey failures00:37:28 : which metrics to optimize for 00:40:23 : Checkpointing Strategies 00:44:48 : USE Utilization, Saturation, Errors 00:50:11 : SageMaker Hyperpod for Inferencing 00:54:58 : Resiliency in Training vs Inferencing workloads 00:56:44 : NVIDIA NeMo Ecosystem and Agents00:59:49 : Future Trends in AI 01:03:17 : Closing Thoughts

More episodes of the podcast AWS For AI

EP10: Agentic Twins and The Foundations of Physical AI | AWS for AI Podcast 04/11/2025

EP9: Lucidya : AI For Enhanced Customer Experience and Social Listening | AWS for AI Podcast 02/10/2025

EP7: Next Generation Developers | AWS for AI Podcast 19/08/2025

EP6: Breaking Language Barriers with AI - The Camb.ai Story | AWS for AI Podcast 06/08/2025

EP5: MBZUAI, CMU : Causal AI, Answering The “Why“ and “What if“ Questions |AWS for AI Podcast 02/07/2025

EP4: How to Succeed with GenAI: From Agents to Enterprise Scale | AWS for AI Podcast 18/06/2025

EP3: Virtue, understanding human emotions through AI | AWS for AI podcast 04/06/2025

EP2: ARCEE.AI small language models, open source and cost efficient AI | AWS for AI podcast 20/05/2025

EP1: AppliedAI’s world’s first large work model: predicting actions not words | AWS for AI podcast 20/05/2025

Ver todos los episodios

ZARZA We are Zarza, the prestigious firm behind major projects in information technology.

EP8: Training Models at Scale | AWS for AI Podcast

Listen "EP8: Training Models at Scale | AWS for AI Podcast"

Episode Synopsis

More episodes of the podcast AWS For AI

Googling with breathtaking tricks you ignore

Subdomains, a glance with the experts!

Bandwidth: Broadband or Narrowband?

Personnel recruitment via Web

Deep web or Invisible Internet

Subdomains, a glance with the experts!

Free Internet, a prediction in Nostradamus style

Educational Technology: From traditional to digital

Localhost, there’s no place like 127.0.0.1

Googling with breathtaking tricks you ignore

Gray Hat Hacking, those with ambiguous ethics…

Internet Predators on the prowl

Dot COM: The Internet’s dominant TLD