Listen "ChatGPT 5.1 Codex Max"
Episode Synopsis
OpenAI has given us GPT-5.1-Codex-Max, their best coding model for OpenAI Codex.
They claim it is faster, more capable and token-efficient and has better persistence on long tasks.
It scores 77.9% on SWE-bench-verified, 79.9% on SWE-Lancer-IC SWE and 58.1% on Terminal-Bench 2.0, all substantial gains over GPT-5.1-Codex.
It's triggering OpenAI to prepare for being high level in cybersecurity threats.
There's a 27 page system card. One could call this the secret ‘real’ GPT-5.1 that matters.
They even finally trained it to use Windows, somehow this is a new idea.
My goal is for my review of Opus 4.5 to start on Friday, as it takes a few days to sort through new releases. This post was written before Anthropic revealed Opus 4.5, and we don’t yet know how big an upgrade Opus 4.5 will prove to be. As always, try all your various options and choose what is best for you.
The Famous METR Graph
GPT-5.1-Codex-Max is a new high on the METR graph. METR's thread is here.
Prinz: METR (50% accuracy):
GPT-5.1-Codex-Max = 2 hours, 42 minutes
This is 25 minutes longer than GPT-5.
Samuel Albanie [...] ---Outline:(01:18) The Famous METR Graph(02:46) The System Card(03:43) Basic Disallowed Content(04:17) Sandbox(05:34) Mitigations For Harmful Tasks and Prompt Injections(06:13) Preparedness Framework(06:35) Biological and Chemical(07:50) Cybersecurity(11:58) AI Self-Improvement(14:27) Reactions ---
First published:
November 25th, 2025
Source:
https://www.lesswrong.com/posts/YMFYQpsY2MGbXKPtS/chatgpt-5-1-codex-max
---
Narrated by TYPE III AUDIO.
---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
They claim it is faster, more capable and token-efficient and has better persistence on long tasks.
It scores 77.9% on SWE-bench-verified, 79.9% on SWE-Lancer-IC SWE and 58.1% on Terminal-Bench 2.0, all substantial gains over GPT-5.1-Codex.
It's triggering OpenAI to prepare for being high level in cybersecurity threats.
There's a 27 page system card. One could call this the secret ‘real’ GPT-5.1 that matters.
They even finally trained it to use Windows, somehow this is a new idea.
My goal is for my review of Opus 4.5 to start on Friday, as it takes a few days to sort through new releases. This post was written before Anthropic revealed Opus 4.5, and we don’t yet know how big an upgrade Opus 4.5 will prove to be. As always, try all your various options and choose what is best for you.
The Famous METR Graph
GPT-5.1-Codex-Max is a new high on the METR graph. METR's thread is here.
Prinz: METR (50% accuracy):
GPT-5.1-Codex-Max = 2 hours, 42 minutes
This is 25 minutes longer than GPT-5.
Samuel Albanie [...] ---Outline:(01:18) The Famous METR Graph(02:46) The System Card(03:43) Basic Disallowed Content(04:17) Sandbox(05:34) Mitigations For Harmful Tasks and Prompt Injections(06:13) Preparedness Framework(06:35) Biological and Chemical(07:50) Cybersecurity(11:58) AI Self-Improvement(14:27) Reactions ---
First published:
November 25th, 2025
Source:
https://www.lesswrong.com/posts/YMFYQpsY2MGbXKPtS/chatgpt-5-1-codex-max
---
Narrated by TYPE III AUDIO.
---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
More episodes of the podcast LessWrong posts by zvi
“Claude Codes” by Zvi
09/01/2026
“AI #150: While Claude Codes” by Zvi
08/01/2026
“Advancements In Self-Driving Cars” by Zvi
07/01/2026
“Dos Capital” by Zvi
05/01/2026
“Fertility Roundup #5: Causation” by Zvi
02/01/2026
“AI #149: 3” by Zvi
01/01/2026
“2025 Year in Review” by Zvi
31/12/2025
ZARZA We are Zarza, the prestigious firm behind major projects in information technology.