“Mainstream approach for alignment evals is a dead end” by Igor Ivanov

07/01/2026 10 min

Listen "“Mainstream approach for alignment evals is a dead end” by Igor Ivanov"

Descargar episodio Ver en sitio original

Episode Synopsis

The problem of evaluation awareness I've taken on the task of making highly realistic alignment evaluations, and I'm now sure that the mainstream approach of creating such evals is a dead end and should change. When we run unrealistic alignment evals, models recognize that they are being evaluated. For example, when Anthropic evaluated Claude Sonnet 4.5 on their alignment tests, they found that the rate of misalignment had dropped to almost 0% compared to 10% for Claude Sonnet 4.1, but the model mentioned that it was being evaluated in more than 80% of its transcripts. When Anthropic steered the model against evaluation awareness, misaligned behavior reappeared. The evals were measuring evaluation awareness, not alignment because the model recognized that it's being evaluated. The natural solution is to make evals more realistic: Add context, remove artificial patterns, attempt to use feedback from the evaluated model on what makes these evals look fake, but I believe that it's a fundamentally flawed approach. In this post I'll explain why this is the case, and describe some directions that might actually help. The current paradigm and its problems The most common approach for agentic alignment evals follows a pattern: Give a model [...] ---Outline:(00:11) The problem of evaluation awareness(01:21) The current paradigm and its problems(03:11) Why just make it more realistic is unlikely to work(03:33) The problem with identifying eval features(04:52) The somethings off problem(06:37) Capability evals are different(07:10) Possible solutions(07:17) Production evaluations(08:22) Leveraging model internal values(09:05) Modify real conversations(09:29) Conclusion ---
First published:
January 6th, 2026

Source:
https://www.lesswrong.com/posts/GctsnCDxr73G4WiTq/mainstream-approach-for-alignment-evals-is-a-dead-end
---
Narrated by TYPE III AUDIO.

More episodes of the podcast LessWrong (30+ Karma)

“Two Aspects of Situational Awareness: World Modelling & Indexical Information” by David Scott Krueger (formerly: capybaralet) 08/01/2026

“Public intellectuals need to say what they actually believe” by Aaron Bergman 08/01/2026

“Broadening the training set should help with alignment” by Seth Herd 07/01/2026

“How hard is it to inoculate against misalignment generalization?” by Jozdien 07/01/2026

“The Evolution Argument Sucks” by peralice 06/01/2026

“How AI Is Learning to Think in Secret” by Nicholas Andresen 06/01/2026

“On Owning Galaxies” by Simon Lermen 06/01/2026

“Exploring Reinforcement Learning Effects on Chain-of-Thought Legibility” by Julian H, RohanS, Baram Sosis, vedant-badoni, The-Turtle 06/01/2026

“Oversight Assistants: Turning Compute into Understanding” by jsteinhardt 06/01/2026

“Axiological Stopsigns” by JenniferRM 06/01/2026

Ver todos los episodios

ZARZA We are Zarza, the prestigious firm behind major projects in information technology.

“Mainstream approach for alignment evals is a dead end” by Igor Ivanov

Listen "“Mainstream approach for alignment evals is a dead end” by Igor Ivanov"

Episode Synopsis

More episodes of the podcast LessWrong (30+ Karma)

Dot COM: The Internet’s dominant TLD

White Hat Hacking, Ethical Hackers…

Bandwidth: Broadband or Narrowband?

Personnel recruitment via Web

Deep web or Invisible Internet

Subdomains, a glance with the experts!

Free Internet, a prediction in Nostradamus style

Educational Technology: From traditional to digital

Localhost, there’s no place like 127.0.0.1

Googling with breathtaking tricks you ignore

Gray Hat Hacking, those with ambiguous ethics…

Internet Predators on the prowl

Dot COM: The Internet’s dominant TLD