“Mainstream approach for alignment evals is a dead end” by Igor Ivanov

07/01/2026 10 min
“Mainstream approach for alignment evals is a dead end” by Igor Ivanov

Listen "“Mainstream approach for alignment evals is a dead end” by Igor Ivanov"

Episode Synopsis

The problem of evaluation awareness I've taken on the task of making highly realistic alignment evaluations, and I'm now sure that the mainstream approach of creating such evals is a dead end and should change. When we run unrealistic alignment evals, models recognize that they are being evaluated. For example, when Anthropic evaluated Claude Sonnet 4.5 on their alignment tests, they found that the rate of misalignment had dropped to almost 0% compared to 10% for Claude Sonnet 4.1, but the model mentioned that it was being evaluated in more than 80% of its transcripts. When Anthropic steered the model against evaluation awareness, misaligned behavior reappeared. The evals were measuring evaluation awareness, not alignment because the model recognized that it's being evaluated. The natural solution is to make evals more realistic: Add context, remove artificial patterns, attempt to use feedback from the evaluated model on what makes these evals look fake, but I believe that it's a fundamentally flawed approach. In this post I'll explain why this is the case, and describe some directions that might actually help. The current paradigm and its problems The most common approach for agentic alignment evals follows a pattern: Give a model [...] ---Outline:(00:11) The problem of evaluation awareness(01:21) The current paradigm and its problems(03:11) Why just make it more realistic is unlikely to work(03:33) The problem with identifying eval features(04:52) The somethings off problem(06:37) Capability evals are different(07:10) Possible solutions(07:17) Production evaluations(08:22) Leveraging model internal values(09:05) Modify real conversations(09:29) Conclusion ---
First published:
January 6th, 2026

Source:
https://www.lesswrong.com/posts/GctsnCDxr73G4WiTq/mainstream-approach-for-alignment-evals-is-a-dead-end
---
Narrated by TYPE III AUDIO.

More episodes of the podcast LessWrong (30+ Karma)