AI Control: Improving Safety Despite Intentional Subversion

15/12/2023 16 min

Listen "AI Control: Improving Safety Despite Intentional Subversion"

Episode Synopsis

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.We’ve released a paper, AI Control: Improving Safety Despite Intentional Subversion. This paper explores techniques that prevent AI catastrophes even if AI instances are colluding to subvert the safety techniques. In this post: We summarize the paper;We compare our methodology to what the one used in other safety papers.The next post in this sequence (which we’ll release in the coming weeks) discusses what we mean by AI control and argues that it is a promising methodology for reducing risk from scheming models.Here's the abstract of the paper:As large language models (LLMs) become more powerful and are deployed more autonomously, it will be increasingly important to prevent them from causing harmful outcomes. Researchers have investigated a variety of safety techniques for this purpose, e.g. using models to review the outputs of other models [...]--- First published: December 13th, 2023 Source: https://www.lesswrong.com/posts/d9FJHawgkiMSPjagR/ai-control-improving-safety-despite-intentional-subversion --- Narrated by TYPE III AUDIO.

More episodes of the podcast LessWrong (Curated & Popular)

"How AI Is Learning to Think in Secret" by Nicholas Andresen 08/01/2026

"On Owning Galaxies" by Simon Lermen 08/01/2026

"AI Futures Timelines and Takeoff Model: Dec 2025 Update" by elifland, bhalstead, Alex Kastner, Daniel Kokotajlo 06/01/2026

"In My Misanthropy Era" by jenn 05/01/2026

"2025 in AI predictions" by jessicata 02/01/2026

"Good if make prior after data instead of before" by dynomight 27/12/2025

"Measuring no CoT math time horizon (single forward pass)" by ryan_greenblatt 27/12/2025

"Recent LLMs can use filler tokens or problem repeats to improve (no-CoT) math performance" by ryan_greenblatt 23/12/2025

"Turning 20 in the probable pre-apocalypse" by Parv Mahajan 23/12/2025

"Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment" by Cam, Puria Radmard, Kyle O’Brien, David Africa, Samuel Ratnam, andyk 23/12/2025

Ver todos los episodios

ZARZA We are Zarza, the prestigious firm behind major projects in information technology.

AI Control: Improving Safety Despite Intentional Subversion

Listen "AI Control: Improving Safety Despite Intentional Subversion"

Episode Synopsis

More episodes of the podcast LessWrong (Curated & Popular)

Googling with breathtaking tricks you ignore

White Hat Hacking, Ethical Hackers…

Bandwidth: Broadband or Narrowband?

Personnel recruitment via Web

Deep web or Invisible Internet

Subdomains, a glance with the experts!

Free Internet, a prediction in Nostradamus style

Educational Technology: From traditional to digital

Localhost, there’s no place like 127.0.0.1

Googling with breathtaking tricks you ignore

Gray Hat Hacking, those with ambiguous ethics…

Internet Predators on the prowl

Dot COM: The Internet’s dominant TLD