Listen "o3 - wow"
Episode Synopsis
o3 isn’t one of the biggest developments in AI for 2+ years because it beats a particular benchmark. It is so because it demonstrates a reusable technique through which almost any benchmark could fall, and at short notice. I’ll cover all the highlights, benchmarks broken, and what comes next. Plus, the costs OpenAI didn’t want us to know, Genesis, ARC-AGI 2, Gemini-Thinking, and much more. FrontierMath: https://epoch.ai/frontiermathhttps://arxiv.org/pdf/2411.04872Chollet Statement:https://arcprize.org/blog/oai-o3-pub-breakthroughMLC Paper: https://www.scientificamerican.com/article/new-training-method-helps-ai-generalize-like-people-do/?utm_campaign=socialflow&utm_source=twitter&utm_medium=socialAlphaCode 2: https://storage.googleapis.com/deepmind-media/AlphaCode2/AlphaCode2_Tech_Report.pdfHuman Performance on ARC-AGI: https://arxiv.org/pdf/2409.01374v1Wei Tweet ‘3 months’:https://x.com/_jasonwei/status/1870184982007644614Deliberative Alignment Paper: https://openai.com/index/deliberative-alignment/Brown Safety Tweet: https://x.com/polynoamial/status/1870196476908834893Swe-Bench Verified: https://openai.com/index/introducing-swe-bench-verified/Amodei Prediction: https://x.com/OfirPress/status/1858567863788769518David Dohan: 16 hours https://x.com/dmdohan/status/1870171404093796638OpenAI Personal Writing: https://openai.com/index/learning-to-reason-with-llms/https://simple-bench.com/John Hallman Tweet: https://x.com/johnohallman/status/187023337568194572500:00 - Introduction01:19 - What is o3?03:18 - FrontierMath05:15 - o4, o506:03 - GPQA06:24 - Coding, Codeforces + SWE-verified, AlphaCode 208:13 - 1st Caveat09:03 - Compositionality?10:16 - SimpleBench?13:11 - ARC-AGI, Chollet
More episodes of the podcast AI Explained Official Podcast
GPT-5 has Arrived
08/08/2025
Grok 4 - 10 New Things to Know
10/07/2025
When Will AI Models Blackmail You, and Why?
24/06/2025