"When can we trust model evaluations?" bu evhub

09/08/2023 17 min

Listen ""When can we trust model evaluations?" bu evhub"

Episode Synopsis

In "Towards understanding-based safety evaluations," I discussed why I think evaluating specifically the alignment of models is likely to require mechanistic, understanding-based evaluations rather than solely behavioral evaluations. However, I also mentioned in a footnote why I thought behavioral evaluations would likely be fine in the case of evaluating capabilities rather than evaluating alignment:However, while I like the sorts of behavioral evaluations discussed in the GPT-4 System Card (e.g. ARC's autonomous replication evaluation) as a way of assessing model capabilities, I have a pretty fundamental concern with these sorts of techniques as a mechanism for eventually assessing alignment.That's because while I think it would be quite tricky for a deceptively aligned AI to sandbag its capabilities when explicitly fine-tuned on some capabilities task (that probably requires pretty advanced gradient hacking), it should be quite easy for such a model to pretend to look aligned.In this post, I want to try to expand a bit on this point and explain exactly what assumptions I think are necessary for various different evaluations to be reliable and trustworthy. For that purpose, I'm going to talk about four different categories of evaluations and what assumptions I think are needed to make each one go through.Source:https://www.lesswrong.com/posts/dBmfb76zx6wjPsBC7/when-can-we-trust-model-evaluationsNarrated for LessWrong by TYPE III AUDIO.Share feedback on this narration.[Curated Post] ✓

More episodes of the podcast LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen 08/01/2026

"On Owning Galaxies" by Simon Lermen 08/01/2026

"AI Futures Timelines and Takeoff Model: Dec 2025 Update" by elifland, bhalstead, Alex Kastner, Daniel Kokotajlo 06/01/2026

"In My Misanthropy Era" by jenn 05/01/2026

"2025 in AI predictions" by jessicata 02/01/2026

"Good if make prior after data instead of before" by dynomight 27/12/2025

"Measuring no CoT math time horizon (single forward pass)" by ryan_greenblatt 27/12/2025

"Recent LLMs can use filler tokens or problem repeats to improve (no-CoT) math performance" by ryan_greenblatt 23/12/2025

"Turning 20 in the probable pre-apocalypse" by Parv Mahajan 23/12/2025

"Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment" by Cam, Puria Radmard, Kyle O’Brien, David Africa, Samuel Ratnam, andyk 23/12/2025

Ver todos los episodios

ZARZA We are Zarza, the prestigious firm behind major projects in information technology.

"When can we trust model evaluations?" bu evhub

Listen ""When can we trust model evaluations?" bu evhub"

Episode Synopsis

More episodes of the podcast LessWrong (Curated & Popular)

Email on your own domain, luxury or need?

Personnel recruitment via Web

Bandwidth: Broadband or Narrowband?

Personnel recruitment via Web

Deep web or Invisible Internet

Subdomains, a glance with the experts!

Free Internet, a prediction in Nostradamus style

Educational Technology: From traditional to digital

Localhost, there’s no place like 127.0.0.1

Googling with breathtaking tricks you ignore

Gray Hat Hacking, those with ambiguous ethics…

Internet Predators on the prowl

Dot COM: The Internet’s dominant TLD