Listen "Advanced LLM Optimization techniques"
Episode Synopsis
Welcome to another Data Architecture Elevator podcast! Today's discussion is hosted by Paolo Platter supported by our experts Antonino Ingargiola and Irene Donato.
In this episode, we explore effective strategies for optimizing large language models (LLMs) for inference tasks with multimodal data like audio, text, images, and video.
We discuss the shift from online APIs to hosted models, choosing smaller, task-specific models, and leveraging fine-tuning, distillation, quantization, and tensor fusion techniques. We also highlight the role of specialized inference servers such as Triton and Dynamo, and how Kubernetes helps manage horizontal scaling.
Don't forget to follow us on LinkedIn! Enjoy!
In this episode, we explore effective strategies for optimizing large language models (LLMs) for inference tasks with multimodal data like audio, text, images, and video.
We discuss the shift from online APIs to hosted models, choosing smaller, task-specific models, and leveraging fine-tuning, distillation, quantization, and tensor fusion techniques. We also highlight the role of specialized inference servers such as Triton and Dynamo, and how Kubernetes helps manage horizontal scaling.
Don't forget to follow us on LinkedIn! Enjoy!
More episodes of the podcast Data Architecture Elevator
Agents vs Tools: Spot the differences!
03/03/2025
Espresso - WASM and UDF
20/02/2025
Agentic AI
12/02/2025
Data Privacy and Crypto-Shredding
17/12/2024
Espresso - Data Science and Data Engineering
12/12/2024
Espresso - MLOps
03/12/2024
Data Contracts
22/11/2024
Espresso - FinOps
14/11/2024
ZARZA We are Zarza, the prestigious firm behind major projects in information technology.