“Tests of LLM introspection need to rule out causal bypassing” by Adam Morris, Dillon Plunkett

29/11/2025 8 min

Listen "“Tests of LLM introspection need to rule out causal bypassing” by Adam Morris, Dillon Plunkett"

Descargar episodio Ver en sitio original

Episode Synopsis

This point has been floating around implicitly in various papers (e.g., Betley et al., Plunkett et al., Lindsey), but we haven’t seen it named explicitly. We think it's important, so we’re describing it here. There's been growing interest in testing whether LLMs can introspect on their internal states or processes. Like Lindsey, we take “introspection” to mean that a model can report on its internal states in a way that satisfies certain intuitive properties (e.g., the model's self-reports are accurate and not just inferences made by observing its own outputs). In this post, we focus on the property that Lindsey calls “grounding”. It can’t just be that the model happens to know true facts about itself; genuine introspection must causally depend on (i.e., be “grounded” in) the internal state or process that it describes. In other words, a model must report that it possesses State X or uses Algorithm Y because it actually has State X or uses Algorithm Y.[1] We focus on this criterion because it is relevant if we want to leverage LLM introspection for AI safety; self-reports that are causally dependent on the internal states they describe are more likely to retain their accuracy in novel [...] The original text contained 4 footnotes which were omitted from this narration. ---
First published:
November 28th, 2025

Source:
https://www.lesswrong.com/posts/LD8yupMtE6btAE3R9/tests-of-llm-introspection-need-to-rule-out-causal-bypassing
---
Narrated by TYPE III AUDIO.
---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

More episodes of the podcast LessWrong (30+ Karma)

“Announcing RoastMyPost” by ozziegooen 17/12/2025

“The Bleeding Mind” by Adele Lopez 17/12/2025

“Towards training-time mitigations for alignment faking in RL” by Vlad Mikulik, Hoagy, Joe Benton, Benjamin Wright, Jonathan Uesato, Monte M, Fabien Roger, evhub 17/12/2025

“Still Too Soon” by Gordon Seidoh Worley 17/12/2025

“Non-Scheming Saints (Whether Human Or Digital) Might Be Shirking Their Governance Duties, And, If True, It Is Probably An Objective Tragedy” by JenniferRM 17/12/2025

“Mistakes in the Moonshot Alignment Program and What we’ll improve for next time” by Kabir Kumar 17/12/2025

“Dancing in a World of Horseradish” by lsusr 17/12/2025

[Linkpost] “Announcing: MIRI Technical Governance Team Research Fellowship” by yams, peterbarnett, Aaron_Scher, Robi Rahman 17/12/2025

“Radiology Automation Does Not Generalize to Other Jobs” by Xodarap 16/12/2025

“GPT-5.2 Is Frontier Only For The Frontier” by Zvi 16/12/2025

Ver todos los episodios

ZARZA We are Zarza, the prestigious firm behind major projects in information technology.

“Tests of LLM introspection need to rule out causal bypassing” by Adam Morris, Dillon Plunkett

Listen "“Tests of LLM introspection need to rule out causal bypassing” by Adam Morris, Dillon Plunkett"

Episode Synopsis

More episodes of the podcast LessWrong (30+ Karma)

Telecommuting for employees of trust

Information Technology (IT)

Bandwidth: Broadband or Narrowband?

Personnel recruitment via Web

Deep web or Invisible Internet

Subdomains, a glance with the experts!

Free Internet, a prediction in Nostradamus style

Educational Technology: From traditional to digital

Localhost, there’s no place like 127.0.0.1

Googling with breathtaking tricks you ignore

Gray Hat Hacking, those with ambiguous ethics…

Internet Predators on the prowl

Dot COM: The Internet’s dominant TLD