Gemini's Multimodality

02/07/2025 44 min Episodio 10
Gemini's Multimodality

Listen "Gemini's Multimodality"

Episode Synopsis

Ani Baddepudi, Gemini Model Behavior Product Lead, joins host Logan Kilpatrick for a deep dive into Gemini's multimodal capabilities. Their conversation explores why Gemini was built as a natively multimodal model from day one, the future of proactive AI assistants, and how we are moving towards a world where "everything is vision." Learn about the differences between video and image understanding and token representations, higher FPS video sampling, and more. Chapters:0:00 - Intro1:12 - Why Gemini is natively multimodal2:23 - The technology behind multimodal models5:15 - Video understanding with Gemini 2.59:25 - Deciding what to build next13:23 - Building new product experiences with multimodal AI17:15 - The vision for proactive assistants24:13 - Improving video usability with variable FPS and frame tokenization27:35 - What’s next for Gemini’s multimodal development31:47 - Deep dive on Gemini’s document understanding capabilities37:56 - The teamwork and collaboration behind Gemini40:56 - What’s next with model behaviorWatch on YouTube: https://www.youtube.com/watch?v=K4vXvaRV0dw